Introduction:

In this Project, we will go through the full data science life cycle, from data cleaning and feature selection to machine learning. We will focus on credit modelling, a well-known data science problem that focuses on modeling a borrower's credit risk. Credit has played a key role in the economy for centuries and some form of credit has existed since the beginning of commerce. We'll be working with financial lending data from Lending Club. Lending Club is a marketplace for personal loans that matches borrowers who are seeking a loan with investors looking to lend money and make a return. You can read more about their marketplace here (https://www.lendingclub.com/company/about-us?)

In [104]:
import pandas as pd

In [157]:
df = pd.read_csv('/Users/user/Downloads/loans_2007.csv',  dtype='unicode')

In [163]:
df

Unnamed: 0,loan_amnt,term,int_rate,installment,emp_length,home_ownership,annual_inc,verification_status,loan_status,pymnt_plan,...,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,5000.0,36 months,10.65%,162.87,10+ years,RENT,24000.0,Verified,Fully Paid,n,...,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,2500.0,60 months,15.27%,59.83,< 1 year,RENT,30000.0,Source Verified,Charged Off,n,...,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,2400.0,36 months,15.96%,84.33,10+ years,RENT,12252.0,Not Verified,Fully Paid,n,...,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,10000.0,36 months,13.49%,339.31,10+ years,RENT,49200.0,Source Verified,Fully Paid,n,...,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,3000.0,60 months,12.69%,67.79,1 year,RENT,80000.0,Source Verified,Current,n,...,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42533,2525.0,36 months,9.33%,80.69,< 1 year,RENT,110000.0,Not Verified,Does not meet the credit policy. Status:Fully ...,n,...,82.03,May-2007,,1.0,INDIVIDUAL,,,,,
42534,6500.0,36 months,8.38%,204.84,< 1 year,NONE,,Not Verified,Does not meet the credit policy. Status:Fully ...,n,...,205.32,Aug-2007,,1.0,INDIVIDUAL,,,,,
42535,5000.0,36 months,7.75%,156.11,10+ years,MORTGAGE,70000.0,Not Verified,Does not meet the credit policy. Status:Fully ...,n,...,156.39,Feb-2015,,1.0,INDIVIDUAL,,,,,
42536,,,,,,,,,,,...,,,,,,,,,,


## Data Cleaning 

In [107]:
df.duplicated().any()

False

In [108]:
df.shape

(42538, 52)

In [219]:
df.drop(["id", "member_id", "funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d"], 
        inplace = True , axis=1)

IndentationError: unexpected indent (1530512451.py, line 2)

In [110]:
df.shape

(42538, 44)

In [218]:
df.drop(["zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp"], 
        inplace = True, axis=1)

IndentationError: unexpected indent (3935434766.py, line 2)

In [144]:
df.shape

(42538, 38)

In [167]:
df.drop(["total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"],
        inplace = True, axis=1)

In [114]:
df.shape

(42538, 32)

In [115]:
df.head(1)

Unnamed: 0,loan_amnt,term,int_rate,installment,emp_length,home_ownership,annual_inc,verification_status,loan_status,pymnt_plan,...,initial_list_status,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,5000.0,36 months,10.65%,162.87,10+ years,RENT,24000.0,Verified,Fully Paid,n,...,f,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


Since we're interested in being able to predict which of these 2 values a loan will fall under, we can treat the problem as a binary classification one. Let's remove all the loans that don't contain either Fully Paid or Charged Off as the loan's status. After the removal of the loan statuses, then transform the Fully Paid values to 1 for the positive case and the Charged Off values to 0 for the negative case. While there are a few different ways to transform all of the values in a column, we'll use the Dataframe method replace. According to the documentation, we can pass the replace method a nested mapping dictionary in the following format:

In [168]:
df.loan_status.value_counts()

Fully Paid                                             33136
Charged Off                                             5634
Does not meet the credit policy. Status:Fully Paid      1988
Current                                                  961
Does not meet the credit policy. Status:Charged Off      761
Late (31-120 days)                                        24
In Grace Period                                           20
Late (16-30 days)                                          8
Default                                                    3
Name: loan_status, dtype: int64

In [146]:
df = df[(df['loan_status'] == 'Fully Paid') | (df['loan_status']== 'Charged Off')]

In [148]:
status_replace = {
    "loan_status" : {
        'Fully Paid': 1,
        'Charged Off': 0
    }
}

In [169]:
df = df.replace(status_replace)

let's look for any columns that contain only one unique value and remove them. These columns won't be useful for the model since they don't add any information to each loan application. In addition, removing these columns will reduce the number of columns we'll need to explore in the future.

We'll need to compute the number of unique values in each column and drop the columns that contain only one unique value. While the Series method unique returns the unique values in a column, it also counts the Pandas missing value object nan as a value:

In [170]:
orig_columns = df.columns
drop_columns = []
for col in orig_columns:
    col_series = df[col].dropna().unique()
    if len(col_series) == 1:
        drop_columns.append(col)
df = df.drop(drop_columns,  axis=1)

In [121]:
df

Unnamed: 0,loan_amnt,term,int_rate,installment,emp_length,home_ownership,annual_inc,verification_status,loan_status,purpose,...,delinq_2yrs,earliest_cr_line,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,last_credit_pull_d,pub_rec_bankruptcies
0,5000.0,36 months,10.65%,162.87,10+ years,RENT,24000.0,Verified,1,credit_card,...,0.0,Jan-1985,1.0,3.0,0.0,13648.0,83.7%,9.0,Jun-2016,0.0
1,2500.0,60 months,15.27%,59.83,< 1 year,RENT,30000.0,Source Verified,0,car,...,0.0,Apr-1999,5.0,3.0,0.0,1687.0,9.4%,4.0,Sep-2013,0.0
2,2400.0,36 months,15.96%,84.33,10+ years,RENT,12252.0,Not Verified,1,small_business,...,0.0,Nov-2001,2.0,2.0,0.0,2956.0,98.5%,10.0,Jun-2016,0.0
3,10000.0,36 months,13.49%,339.31,10+ years,RENT,49200.0,Source Verified,1,other,...,0.0,Feb-1996,1.0,10.0,0.0,5598.0,21%,37.0,Apr-2016,0.0
5,5000.0,36 months,7.90%,156.46,3 years,RENT,36000.0,Source Verified,1,wedding,...,0.0,Nov-2004,3.0,9.0,0.0,7963.0,28.3%,12.0,Jan-2016,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39781,2500.0,36 months,8.07%,78.42,4 years,MORTGAGE,110000.0,Not Verified,1,home_improvement,...,0.0,Nov-1990,0.0,13.0,0.0,7274.0,13.1%,40.0,Jun-2010,
39782,8500.0,36 months,10.28%,275.38,3 years,RENT,18000.0,Not Verified,1,credit_card,...,1.0,Dec-1986,1.0,6.0,0.0,8847.0,26.9%,9.0,Jul-2010,
39783,5000.0,36 months,8.07%,156.84,< 1 year,MORTGAGE,100000.0,Not Verified,1,debt_consolidation,...,0.0,Oct-1998,0.0,11.0,0.0,9698.0,19.4%,20.0,Jun-2007,
39784,5000.0,36 months,7.43%,155.38,< 1 year,MORTGAGE,200000.0,Not Verified,1,other,...,0.0,Nov-1988,0.0,17.0,0.0,85607.0,0.7%,26.0,Jun-2007,


## Feature Engineering 
we'll prepare the data for machine learning by focusing on handling missing values, converting categorical columns to numeric columns, and removing any other extraneous columns we encounter throughout this process.

Let's start by computing the number of missing values and come up with a strategy for handling them. Then, we'll focus on the categorical columns.

In [171]:
# Use the isnull and sum methods to return the number of null values in each column. Assign the resulting Series object to null_counts
null_counts = df.isnull().sum()

In [172]:
null_counts[null_counts>0]

loan_amnt                  3
term                       3
int_rate                   3
installment                3
emp_length              1115
home_ownership             3
annual_inc                 7
verification_status        3
loan_status                3
pymnt_plan                 3
purpose                    3
title                     16
addr_state                 3
dti                        3
delinq_2yrs               32
earliest_cr_line          32
inq_last_6mths            32
open_acc                  32
pub_rec                   32
revol_bal                  3
revol_util                93
total_acc                 32
last_credit_pull_d         7
acc_now_delinq            32
delinq_amnt               32
pub_rec_bankruptcies    1368
tax_liens                108
dtype: int64

While most of the columns have no missing values, two columns have fifty or less rows with missing values, and two columns, emp_length and pub_rec_bankruptcies, contain a relatively high number of missing values.

Domain knowledge tells us that employment length is frequently used in assessing how risky a potential borrower is, so we'll keep this column despite its relatively large number of missing values.

Let's inspect the values of the column pub_rec_bankruptcies

In [173]:
print(df.pub_rec_bankruptcies.value_counts(normalize=True, dropna=False))

0.0    0.924256
1.0    0.043396
NaN    0.032159
2.0    0.000188
Name: pub_rec_bankruptcies, dtype: float64


We see that this column offers very little variability, nearly 94% of values are in the same category. It probably won't have much predictive value. Let's drop it. In addition, we'll remove the remaining rows containing null values.

In [216]:
df = df.drop('pub_rec_bankruptcies', axis = 1, inplace = True)

In [217]:
df = df.dropna(axis =0)

In [166]:
df.shape

(42538, 38)

In [177]:
df

In [179]:
loan = pd.read_csv('/Users/user/Downloads/filtered_loans_2007.csv')

In [180]:
df

In [181]:
loan

Unnamed: 0,loan_amnt,term,int_rate,installment,emp_length,home_ownership,annual_inc,verification_status,loan_status,purpose,...,delinq_2yrs,earliest_cr_line,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,last_credit_pull_d,pub_rec_bankruptcies
0,5000.0,36 months,10.65%,162.87,10+ years,RENT,24000.0,Verified,1,credit_card,...,0.0,Jan-1985,1.0,3.0,0.0,13648.0,83.7%,9.0,Jun-2016,0.0
1,2500.0,60 months,15.27%,59.83,< 1 year,RENT,30000.0,Source Verified,0,car,...,0.0,Apr-1999,5.0,3.0,0.0,1687.0,9.4%,4.0,Sep-2013,0.0
2,2400.0,36 months,15.96%,84.33,10+ years,RENT,12252.0,Not Verified,1,small_business,...,0.0,Nov-2001,2.0,2.0,0.0,2956.0,98.5%,10.0,Jun-2016,0.0
3,10000.0,36 months,13.49%,339.31,10+ years,RENT,49200.0,Source Verified,1,other,...,0.0,Feb-1996,1.0,10.0,0.0,5598.0,21%,37.0,Apr-2016,0.0
4,5000.0,36 months,7.90%,156.46,3 years,RENT,36000.0,Source Verified,1,wedding,...,0.0,Nov-2004,3.0,9.0,0.0,7963.0,28.3%,12.0,Jan-2016,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38765,2500.0,36 months,8.07%,78.42,4 years,MORTGAGE,110000.0,Not Verified,1,home_improvement,...,0.0,Nov-1990,0.0,13.0,0.0,7274.0,13.1%,40.0,Jun-2010,
38766,8500.0,36 months,10.28%,275.38,3 years,RENT,18000.0,Not Verified,1,credit_card,...,1.0,Dec-1986,1.0,6.0,0.0,8847.0,26.9%,9.0,Jul-2010,
38767,5000.0,36 months,8.07%,156.84,< 1 year,MORTGAGE,100000.0,Not Verified,1,debt_consolidation,...,0.0,Oct-1998,0.0,11.0,0.0,9698.0,19.4%,20.0,Jun-2007,
38768,5000.0,36 months,7.43%,155.38,< 1 year,MORTGAGE,200000.0,Not Verified,1,other,...,0.0,Nov-1988,0.0,17.0,0.0,85607.0,0.7%,26.0,Jun-2007,


In [182]:
# Use the isnull and sum methods to return the number of null values in each column. Assign the resulting Series object to null_counts
null_counts = loan.isnull().sum()

In [183]:
null_counts[null_counts>0]

emp_length              1036
title                     11
revol_util                50
last_credit_pull_d         2
pub_rec_bankruptcies     697
dtype: int64

While most of the columns have no missing values, two columns have fifty or less rows with missing values, and two columns, emp_length and pub_rec_bankruptcies, contain a relatively high number of missing values.
Domain knowledge tells us that employment length is frequently used in assessing how risky a potential borrower is, so we'll keep this column despite its relatively large number of missing values.
Let's inspect the values of the column pub_rec_bankruptcies

In [185]:
loan = loan.drop('pub_rec_bankruptcies', axis = 1)

In [186]:
loan = loan.dropna(axis = 0)

In [187]:
loan.dtypes.value_counts()

object     11
float64    10
int64       1
dtype: int64

While the numerical columns can be used natively with scikit-learn, the object columns that contain text need to be converted to numerical data types. Let's return a new dataframe containing just the object columns so we can explore them in more depth

In [189]:
df_objects = loan.select_dtypes(include =['object'])

In [190]:
df_objects.head(1)

Unnamed: 0,term,int_rate,emp_length,home_ownership,verification_status,purpose,title,addr_state,earliest_cr_line,revol_util,last_credit_pull_d
0,36 months,10.65%,10+ years,RENT,Verified,credit_card,Computer,AZ,Jan-1985,83.7%,Jun-2016


Some of the columns seem like they represent categorical values, but we should confirm by checking the number of unique values in those columns:

In [192]:
cols = ['home_ownership', 'verification_status', 'emp_length', 'term', 'addr_state']

for c in cols:
    print(loan[c].value_counts())

RENT        18112
MORTGAGE    16686
OWN          2778
OTHER          96
NONE            3
Name: home_ownership, dtype: int64
Not Verified       16281
Verified           11856
Source Verified     9538
Name: verification_status, dtype: int64
10+ years    8545
< 1 year     4513
2 years      4303
3 years      4022
4 years      3353
5 years      3202
1 year       3176
6 years      2177
7 years      1714
8 years      1442
9 years      1228
Name: emp_length, dtype: int64
 36 months    28234
 60 months     9441
Name: term, dtype: int64
CA    6776
NY    3614
FL    2704
TX    2613
NJ    1776
IL    1447
PA    1442
VA    1347
GA    1323
MA    1272
OH    1149
MD    1008
AZ     807
WA     788
CO     748
NC     729
CT     711
MI     678
MO     648
MN     581
NV     466
SC     454
WI     427
OR     422
LA     420
AL     420
KY     311
OK     285
KS     249
UT     249
AR     229
DC     209
RI     194
NM     180
WV     164
HI     162
NH     157
DE     110
MT      77
WY      76
AK      76
SD      60
VT  

The home_ownership, verification_status, emp_length, term, and addr_state columns all contain multiple discrete values. We should clean the emp_length column and treat it as a numerical one since the values have ordering (2 years of employment is less than 8 years).



It seems like the purpose and title columns do contain overlapping information, but we'll keep the purpose column since it contains a few discrete values. In addition, the title column has data quality issues since many of the values are repeated with slight modifications (e.g. Debt Consolidation and Debt Consolidation Loan and debt consolidation).

In [198]:
mapping_dict = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "n/a": 0
    }
}
#loan = loan.drop(["last_credit_pull_d", "earliest_cr_line", "addr_state", "title"], axis=1)
loan["int_rate"] = loan["int_rate"].astype("float")
loan["revol_util"] = loan["revol_util"].str.rstrip("%").astype("float")
loan = loan.replace(mapping_dict)

Let's now encode the home_ownership, verification_status, purpose, and term columns as dummy variables so we can use them in our model. We first need to use the Pandas get_dummies method to return a new Dataframe containing a new column for each dummy variable:

In [199]:
cat_columns = ["home_ownership", "verification_status", "purpose", "term"]
dummy_df = pd.get_dummies(loan[cat_columns])
loan = pd.concat([loan, dummy_df], axis=1)
loan = loan.drop(cat_columns, axis=1)

we performed the last amount of data preparation necessary to start training machine learning models. We converted all of the columns to numerical values because those are the only type of value scikit-learn can work with. Next, we'll experiment with training models and evaluating accuracy using cross-validation.

Our goal is to generate features from data, which we can feed into a machine learning algorithm. The algorithm will make predictions about whether or not a loan will be paid off on time, which is contained in the loan_status column of the clean dataset.

As we prepared the data, we removed columns that had data leakage issues, contained redundant information, or required additional processing to turn into useful features. We cleaned features that had formatting issues and converted categorical columns to dummy variables.

In the last section, we noticed that there's a class imbalance in our target column, loan_status. There are about 6 times as many loans that were paid off on time (positive case, label of 1) than those that weren't (negative case, label of 0). Imbalances can cause issues with many machine learning algorithms, where they appear to have high accuracy, but actually aren't learning from the training data. Due to its potential to cause issues, we need to keep the class imbalance in mind as we build machine learning models.

In [202]:
loan.isnull().sum().any()

False

An error metric will help us figure out when our model is performing well, and when it's performing poorly. To tie error metrics all the way back to the original question we wanted to answer, let's say we're using a machine learning model to predict whether or not we should fund a loan on the Lending Club platform. Our objective in this is to make money -- we want to fund enough loans that are paid off on time to offset our losses from loans that aren't paid off. An error metric will help us determine if our algorithm will make us money or lose us money.

In this case, we're primarily concerned with false positives and false negatives. Both of these are different types of misclassifications. With a false positive, we predict that a loan will be paid off on time, but it actually isn't. This costs us money, since we fund loans that lose us money. With a false negative, we predict that a loan won't be paid off on time, but it actually would be paid off on time. This loses us potential money, since we didn't fund a loan that actually would have been paid off.

Since we're viewing this problem from the standpoint of a conservative investor, we need to treat false positives differently than false negatives. A conservative investor would want to minimize risk and avoid false positives as much as possible. They'd be more secure with missing out on opportunities (false negatives) than they would be with funding a risky loan (false positives).

Let's calculate false positives and true positives in Python. We can use multiple conditionals, separated by a & to select items in a NumPy array that meet certain conditions. For instance, if we had an array called predictions, we could select items in predictions that equal 1 and where items in loans["loan_status"] in the same position also equal 1 using this:

Find the number of true negatives.
Find the number of items where predictions is 0, and the corresponding entry in loans["loan_status"] is also 0.
Assign the result to tn.

In order to fit the machine learning models, we'll use the Scikit-learn library. Although we've built our own implementations of algorithms in earlier lessons, it's easier and faster to use algorithms that someone else has already written and tuned for high performance.

A good first algorithm to apply to binary classification problems is logistic regression, for the following reasons:
    
it's quick to train and we can iterate more quickly,
it's less prone to overfitting than more complex models like decision trees,
it's easy to interpret.

In [204]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
cols = loan.columns
train_cols = cols.drop("loan_status")
features = loan[train_cols]
target = loan["loan_status"]
lr.fit(features, target)
predictions = lr.predict(features)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


While we generated predictions in the above, those predictions were overfit. They were overfit because we generated predictions using the same data that we trained our model on. When we use this to evaluate an error, we get an unrealistically high depiction of how accurate the algorithm is, because it already "knows" the correct answers. This is like asking someone to memorize a bunch of physics equations, then asking them to plug numbers into the equations. They can tell you the right answer, but they can't explain a concept that they haven't already memorized an equation for.

In [215]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
lr = LogisticRegression()
predictions = cross_val_predict(lr, features, target, cv=3)
predictions = pd.Series(predictions)
# False positives.
fp_filter = (predictions == 1) & (loan["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loan["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loan["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loan["loan_status"] == 0)
tn = len(predictions[tn_filter])
# Rates
tpr = tp  / (tp + fn)
fpr = fp  / (fp + tn)
print(tpr)
print(fpr)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.9982788844621514
0.9980806142034548


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In order to get a realistic depiction of the accuracy of the model, let's perform k-fold cross validation. We can use the cross_val_predict() function from the sklearn.model_selection package

Unfortunately, even though we're not using accuracy as an error metric, the classifier is, and it isn't accounting for the imbalance in the classes. There are a few ways to get a classifier to correct for imbalanced classes. The two main ways are:

Use oversampling and undersampling to ensure that the classifier gets input that has a balanced number of each class.
Tell the classifier to penalize misclassifications of the less prevalent class more than the other class.

In [214]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
lr = LogisticRegression(class_weight="balanced")
predictions = cross_val_predict(lr, features, target, cv=3)
predictions = pd.Series(predictions)

# False positives.
fp_filter = (predictions == 1) & (loan["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loan["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loan["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loan["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / (tp + fn)
fpr = fp / (fp + tn)

print(tpr)
print(fpr)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.5438406374501992
0.5299424184261037


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


We significantly improved the false positive rate in the last screen by balancing the classes, which reduced true positive rate. Our true positive rate is now around 66%, and our false positive rate is around 39%. From a conservative investor's standpoint, it's reassuring that the false positive rate is lower, because it means that we'll be able to do a better job at avoiding bad loans than if we funded everything. However, we'd only decide to fund 66% of the total loans (true positive rate), so we'd immediately reject a good amount of loans.

We can try to lower the false positive rate further by assigning a harsher penalty for misclassifying the negative class. While setting class_weight to balanced will automatically set a penalty based on the number of 1s and 0s in the column, we can also set a manual penalty. In the last screen, the penalty scikit-learn imposed for misclassifying a 0 would have been around 5.89 (since there are 5.89 times as many 1s as 0s).

Let's try a more complex algorithm, random forest. We learned about random forests in a previous lesson and constructed our own model. Random forests are able to work with nonlinear data and learn complex conditionals. Logistic regressions are only able to work with linear data. Training a random forest algorithm may enable more accuracy due to columns that correlate nonlinearly with loan_status

In [212]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict
rf = RandomForestClassifier(class_weight="balanced", random_state=1)
predictions = cross_val_predict(rf, features, target, cv=3)
predictions = pd.Series(predictions)

# False positives.
fp_filter = (predictions == 1) & (loan["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.`
tp_filter = (predictions == 1) & (loan["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loan["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loan["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / (tp + fn)
fpr = fp / (fp + tn)

print(tpr)
print(fpr)

0.9956653386454183
0.9975047984644914


# conclusion

Ultimately, our best model had a false positive rate of nearly 9%, and a true positive rate of nearly 24%. For a conservative investor, this means that they make money as long as the interest rate is high enough to offset the losses from 9% of borrowers defaulting. In addition, the pool of 24% of borrowers must be large enough to make enough interest money to offset the losses.

If we had randomly picked loans to fund, borrowers would have defaulted on 14.5% of them, and our model is better than that, although we're excluding more loans than a random strategy would. Given this, there's still quite a bit of room to improve:

We can tweak the penalties further.
We can try models other than a random forest and logistic regression.
We can use some of the columns we discarded to generate better features.
We can ensemble multiple models to get more accurate predictions.
We can tune the parameters of the algorithm to achieve higher performance.