<a href="https://colab.research.google.com/github/JudeDihan/Predicting-Good-LendingClub-Investments/blob/master/LendingClub_Predictions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Introduction and Data Cleaning

### Introduction

Lending Club (www.lendingclub.com) is an American peer-to-peer lending company. People could submit an application for a loan, and after Lending Club decides if they could pay back the loan, they let interested investors know and invite them to invest in the loan. 
The investors are given information about the person asking for th loan, and based on that information, they can decide whether or not it is a worthy investment. Investors gain profit from borrowers that pay the amount in full within the agreed time duration.

This project attempts to help the investors make financially safe decisions when it comes to choosing an investment. This is done using the data released from Lending Club for the years 2007 through 2011 in harmony with Machine Learning models. 

The dataset and it's dictionary can be found here: https://data.world/jaypeedevlin/lending-club-loan-data-2007-11

### Importing the libraries and dataset.

In [1]:
import pandas as pd
import numpy as np

In [2]:
loans_0711 = pd.read_csv("lending_club_loans.csv", skiprows=1, low_memory=False)
print('Shape of DataFrame =',loans_0711.shape)

Shape of DataFrame = (42538, 115)


In [3]:
pd.set_option('max_columns', 150)
pd.set_option('max_rows', 150)

Let's take a brief look at the dataset

In [4]:
loans_0711.head(3)

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,fico_range_low,fico_range_high,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,last_fico_range_high,last_fico_range_low,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_il_6m,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,,10+ years,RENT,24000.0,Verified,Dec-2011,Fully Paid,n,https://lendingclub.com/browse/loanDetail.acti...,Borrower added on 12/22/11 > I need to upgra...,credit_card,Computer,860xx,AZ,27.65,0.0,Jan-1985,735.0,739.0,1.0,,,3.0,0.0,13648.0,83.7%,9.0,f,0.0,0.0,5863.155187,5833.84,5000.0,863.16,0.0,0.0,0.0,Jan-2015,171.62,,Sep-2016,744.0,740.0,0.0,,1.0,INDIVIDUAL,,,,0.0,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,Ryder,< 1 year,RENT,30000.0,Source Verified,Dec-2011,Charged Off,n,https://lendingclub.com/browse/loanDetail.acti...,Borrower added on 12/22/11 > I plan to use t...,car,bike,309xx,GA,1.0,0.0,Apr-1999,740.0,744.0,5.0,,,3.0,0.0,1687.0,9.4%,4.0,f,0.0,0.0,1008.71,1008.71,456.46,435.17,0.0,117.08,1.11,Apr-2013,119.66,,Sep-2016,499.0,0.0,0.0,,1.0,INDIVIDUAL,,,,0.0,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,,10+ years,RENT,12252.0,Not Verified,Dec-2011,Fully Paid,n,https://lendingclub.com/browse/loanDetail.acti...,,small_business,real estate business,606xx,IL,8.72,0.0,Nov-2001,735.0,739.0,2.0,,,2.0,0.0,2956.0,98.5%,10.0,f,0.0,0.0,3005.666844,3005.67,2400.0,605.67,0.0,0.0,0.0,Jun-2014,649.91,,Sep-2016,719.0,715.0,0.0,,1.0,INDIVIDUAL,,,,0.0,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,


### Data Cleaning

Let's get rid of any columns with over 50% missing values as it would only have a negative effect on our ML model

In [5]:
null_percentage = (loans_0711.isnull().sum())/len(loans_0711)*100
over_50_cols = null_percentage[null_percentage > 50].index.tolist()
loans_0711 = loans_0711.drop(over_50_cols, axis=1)
loans_0711.shape

(42538, 58)

Let's get rid of the "url" and "desc" columns as well, since they make no sense in a Machine Learning Model.

In [6]:
remove_cols = ["desc", "url"]
loans_0711 = loans_0711.drop(remove_cols, axis=1)
loans_0711.shape

(42538, 56)

**Dropping Columns**

Now let's remove the columns which either...

1.   Has randomly generated values
2.   Leaks data about the prediction
3.   Requiers other data to be useful
4.   Contains redundant information

For indepth clarification, please see the data dictionary here: https://data.world/jaypeedevlin/lending-club-loan-data-2007-11



In [7]:
loans_0711 = loans_0711.drop(["id", "member_id", "funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d"], axis=1)
print(loans_0711.shape)
loans_0711 = loans_0711.drop(["zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp"], axis=1)
print(loans_0711.shape)
loans_0711 = loans_0711.drop(["total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis=1)
print(loans_0711.shape)

(42538, 48)
(42538, 42)
(42538, 36)


We now have reduced our columns number to 36 columns. We belive these columns have quality data that can help our ML model make good predictions

**Picking out the Target Column**

In order for our ML model to make a prediction, it requires a "Target Column". Looking at our list of columns, we must pick a column that accurately describes the status of the loan; paid, defaulted or charged-off. Since the only column that describes this well is the "loan_status" column, let's pick that column as our Target Column.

In [8]:
print(loans_0711["loan_status"].value_counts().sort_values(ascending=False))

Fully Paid                                             33586
Charged Off                                             5653
Does not meet the credit policy. Status:Fully Paid      1988
Does not meet the credit policy. Status:Charged Off      761
Current                                                  513
In Grace Period                                           16
Late (31-120 days)                                        12
Late (16-30 days)                                          5
Default                                                    1
Name: loan_status, dtype: int64


Let's get rid of the rows that do not contain "Fully Paid" or "Charged off" from the "loan_status" column.
Then, let's code the "Fully Paid" catergory as 1, and the "Charged Off" catergory as 2. This will benefit our ML model, as almost all ML models only understand numerical data.

In this project, we are looking from the investor's perspective. Since only the "Fully Paid" and "Charged Off" labels describe the final outcome of the loan, having the other labels could mislead the ML model.

In [9]:
loans_0711 = loans_0711[(loans_0711["loan_status"] == "Fully Paid") | (loans_0711["loan_status"] == "Charged Off")]
replace_code = {"loan_status":{"Fully Paid": 1, "Charged Off": 0}}
loans_0711 = loans_0711.replace(replace_code)

In [10]:
loans_0711["loan_status"].head()

0    1
1    0
2    1
3    1
5    1
Name: loan_status, dtype: int64

**Columns with only one unique value.**

Let's see if there are any columns with only one unique value in them. These columns are redundant in our model.

In [11]:
columns_list = loans_0711.columns.tolist()
remove_cols = []

for col in columns_list:
  unique_vals = loans_0711[col].dropna().unique()
  
  if len(unique_vals) == 1:
    remove_cols.append(col)

loans_0711[remove_cols].head()

Unnamed: 0,initial_list_status,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,tax_liens
0,f,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0
1,f,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0
2,f,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0
3,f,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0
5,f,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0


In [12]:
loans_0711 = loans_0711.drop(remove_cols, axis=1)
print(loans_0711.shape)

(39239, 28)


Let's also get rid or the "pymnt_plan" and the columns containing "fico" as they seem to have misleading values

In [13]:
remove_cols = ['pymnt_plan','fico_range_low','fico_range_high','last_fico_range_high','last_fico_range_low']
loans_0711 = loans_0711.drop(remove_cols, axis=1)
print(loans_0711.shape)

(39239, 23)


Great! We now were able to reduce our set of columns from 115 to 23. We believe that these 23 columns have a direct relatioship with the "loan_status" column, and hence will be able to make good predictions

# 2. Feature Preparation

In this section, we will be dealing with preparing the columns in a suitable way to feed into the Machine Learning algorithm. This includes turning the non-numerical columns in to numerical columns as well.

### Checking the class imbalance

In [14]:
print(loans_0711['loan_status'].value_counts())
print('#'*20)
zero_count_pct = (loans_0711['loan_status']).sum()/(loans_0711['loan_status']== 0).sum()
print(zero_count_pct)

1    33586
0     5653
Name: loan_status, dtype: int64
####################
5.941270122059084


From the above code output, we should also note that we have 6 times more data for loans that were fully paid, than for loans that were "charged-off". This must be kept in mind when developing the ML models, as this causes class imbalances. 

### Handling missing values

Now, let's take a look at the columns with missing values, and see if columns with a large number of missing values can be dropped.

In [15]:
print(loans_0711.isnull().sum().sort_values(ascending=False))

emp_length              1057
pub_rec_bankruptcies     697
revol_util                50
title                     11
last_credit_pull_d         2
purpose                    0
term                       0
int_rate                   0
installment                0
home_ownership             0
annual_inc                 0
verification_status        0
loan_status                0
addr_state                 0
dti                        0
delinq_2yrs                0
earliest_cr_line           0
inq_last_6mths             0
open_acc                   0
pub_rec                    0
revol_bal                  0
total_acc                  0
loan_amnt                  0
dtype: int64


"emp_length" relates to employment length. Since this is a very important factor when it comes to knowing if a person can payback the loan, we must keep this column as it is.

Columns less that 50 missing values doesnt really cause a problem as well. So let us now look at the "pub_rec_bankruptcies".

In [16]:
loans_0711["pub_rec_bankruptcies"].value_counts()

0.0    36872
1.0     1665
2.0        5
Name: pub_rec_bankruptcies, dtype: int64

In [17]:
# Finding the % of people who were not bankrupt, and was able to pay the loans or not
not_bankrupt = loans_0711[loans_0711["pub_rec_bankruptcies"]==0]
print(not_bankrupt["loan_status"].value_counts(normalize=True))


print("*"*20)

# Finding the % of people who have been bankrupt, and was able to pay the loans or not
bankrupt = loans_0711[loans_0711["pub_rec_bankruptcies"]>0]
print(bankrupt["loan_status"].value_counts(normalize=True))

print("*"*20)

# Finding the % of people who has no information under this field, and was able to pay the loans or not
no_info = loans_0711[loans_0711["pub_rec_bankruptcies"].isnull()]
print(no_info["loan_status"].value_counts(normalize=True))




1    0.859948
0    0.140052
Name: loan_status, dtype: float64
********************
1    0.777844
0    0.222156
Name: loan_status, dtype: float64
********************
1    0.830703
0    0.169297
Name: loan_status, dtype: float64


While there seems to be a not an extremly direct correlation between the two sets of people above, the people who had been bankrupt at some point of the time has a lesser chance of paying the loan in full on time (about 7% less). However, a large percentage of people did not mention their bankruptcy history. And when we take all the classes of people (who did not mention bankcruptcy history, not bankcrupt, has been bankcrupt), the variability among all three classes is negligable. So we can actually get rid of this column as well. 

In [18]:
loans_0711 = loans_0711.drop("pub_rec_bankruptcies", axis=1)
loans_0711.shape

(39239, 22)

### Column dtypes and converting of text columns to numerical

Let's take a look at what the column types we have with us now

In [19]:
print(loans_0711.dtypes.sort_values())

loan_status              int64
loan_amnt              float64
revol_bal              float64
installment            float64
pub_rec                float64
open_acc               float64
annual_inc             float64
inq_last_6mths         float64
total_acc              float64
delinq_2yrs            float64
dti                    float64
revol_util              object
earliest_cr_line        object
title                   object
purpose                 object
verification_status     object
home_ownership          object
emp_length              object
int_rate                object
term                    object
addr_state              object
last_credit_pull_d      object
dtype: object


Since we will be using ScikitLearn, most of their algorithms expect the columns to be numerical. Let's take a look at our text columns and see if we can convert them to numerical columns. 

In [20]:
text_df = loans_0711.select_dtypes(include="object")
print(text_df.shape)
text_df.head()


(39239, 11)


Unnamed: 0,term,int_rate,emp_length,home_ownership,verification_status,purpose,title,addr_state,earliest_cr_line,revol_util,last_credit_pull_d
0,36 months,10.65%,10+ years,RENT,Verified,credit_card,Computer,AZ,Jan-1985,83.7%,Sep-2016
1,60 months,15.27%,< 1 year,RENT,Source Verified,car,bike,GA,Apr-1999,9.4%,Sep-2016
2,36 months,15.96%,10+ years,RENT,Not Verified,small_business,real estate business,IL,Nov-2001,98.5%,Sep-2016
3,36 months,13.49%,10+ years,RENT,Source Verified,other,personel,CA,Feb-1996,21%,Apr-2016
5,36 months,7.90%,3 years,RENT,Source Verified,wedding,My wedding loan I promise to pay back,AZ,Nov-2004,28.3%,Jan-2016


We can see that the columns "int_rate" and "revol_util" both are actually having numerical values but are taken as objects because of the "%" sign. Let's get that fixed.

In [21]:
# Deleting rows with missing values in revol_util (less than 0.12% of data)
print(loans_0711["revol_util"].isnull().sum())
drop_index = loans_0711[loans_0711["revol_util"].isnull()].index
loans_0711 = loans_0711.drop(index=drop_index)
print(loans_0711["revol_util"].isnull().sum())


50
0


In [22]:
loans_0711["int_rate"] = loans_0711["int_rate"].str.replace("%","").astype("float")
loans_0711["revol_util"] = loans_0711["revol_util"].str.replace("%","").astype("float")
text_df = loans_0711.select_dtypes(include="object")
print(text_df.shape)
text_df.head()

(39189, 9)


Unnamed: 0,term,emp_length,home_ownership,verification_status,purpose,title,addr_state,earliest_cr_line,last_credit_pull_d
0,36 months,10+ years,RENT,Verified,credit_card,Computer,AZ,Jan-1985,Sep-2016
1,60 months,< 1 year,RENT,Source Verified,car,bike,GA,Apr-1999,Sep-2016
2,36 months,10+ years,RENT,Not Verified,small_business,real estate business,IL,Nov-2001,Sep-2016
3,36 months,10+ years,RENT,Source Verified,other,personel,CA,Feb-1996,Apr-2016
5,36 months,3 years,RENT,Source Verified,wedding,My wedding loan I promise to pay back,AZ,Nov-2004,Jan-2016


Let's explore the unique value counts of the columns that seem like they contain categorical values.

In [23]:
cols = ['home_ownership', 'verification_status', 'emp_length', 'term', 'addr_state']
for c in cols:
    print(loans_0711[c].value_counts())

RENT        18682
MORTGAGE    17385
OWN          3023
OTHER          96
NONE            3
Name: home_ownership, dtype: int64
Not Verified       16816
Verified           12516
Source Verified     9857
Name: verification_status, dtype: int64
10+ years    8716
< 1 year     4544
2 years      4344
3 years      4050
4 years      3387
5 years      3246
1 year       3208
6 years      2199
7 years      1739
8 years      1457
9 years      1245
Name: emp_length, dtype: int64
 36 months    29049
 60 months    10140
Name: term, dtype: int64
CA    7022
NY    3758
FL    2833
TX    2694
NJ    1825
IL    1513
PA    1493
VA    1388
GA    1381
MA    1322
OH    1198
MD    1039
AZ     864
WA     830
CO     778
NC     772
CT     738
MI     718
MO     677
MN     609
NV     488
SC     469
WI     447
OR     441
AL     441
LA     432
KY     319
OK     295
KS     264
UT     255
AR     241
DC     211
RI     197
NM     187
WV     174
HI     170
NH     169
DE     113
MT      84
WY      83
AK      79
SD      61
VT  

The columns mentioned above all have multiple discrete values. It is good practice to clean up the "emp_length" column and treat it as a numerical column, since there is an order to its data. 

Both columns "title" and "purpose" convey the same message. Since the "purpose" column contains few discrete values, we will opt to keep that column.

The "addr_state" column has 50 discrete values. Converting this to dummy variables would add 50 more columns to the ML model dataset. Let's for now drop this column.

In [24]:
mapping_dict = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        np.nan: 0
    }
}

In [25]:
loans_0711 = loans_0711.drop(["last_credit_pull_d", "earliest_cr_line", "addr_state", "title"], axis=1)
loans_0711 = loans_0711.replace(mapping_dict)

In [26]:
text_df = loans_0711.select_dtypes(include="object")
print(text_df.shape)
text_df.head()

(39189, 4)


Unnamed: 0,term,home_ownership,verification_status,purpose
0,36 months,RENT,Verified,credit_card
1,60 months,RENT,Source Verified,car
2,36 months,RENT,Not Verified,small_business
3,36 months,RENT,Source Verified,other
5,36 months,RENT,Source Verified,wedding


Let's now convert the above four columns into catergorical columns, so we can use it in our ML Model

In [27]:
text_cols = text_df.columns.tolist()
dummy_df = pd.get_dummies(loans_0711[text_cols])

Now, let's connect the "dummy_df" to our main dataframe ("loans_0711") and let's drop the "text_cols" columns so we only have numerical columns

In [28]:
loans_0711 = pd.concat([loans_0711,dummy_df], axis=1)
loans_0711 = loans_0711.drop(text_cols, axis=1)
print(loans_0711.shape)

(39189, 38)


In [29]:
loans_0711.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39189 entries, 0 to 39785
Data columns (total 38 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   loan_amnt                            39189 non-null  float64
 1   int_rate                             39189 non-null  float64
 2   installment                          39189 non-null  float64
 3   emp_length                           39189 non-null  int64  
 4   annual_inc                           39189 non-null  float64
 5   loan_status                          39189 non-null  int64  
 6   dti                                  39189 non-null  float64
 7   delinq_2yrs                          39189 non-null  float64
 8   inq_last_6mths                       39189 non-null  float64
 9   open_acc                             39189 non-null  float64
 10  pub_rec                              39189 non-null  float64
 11  revol_bal                   

Now we only have numerical data in our dataset. We can start our next step, which is Machine Learning.

# 3. Making Predictions

### Logistic Regression

Let's start our modeling with Logistic Regression. Since we are dealing with binary classification, this will be a good choice. 
Logistic Regression is also quick to train, and can iterate quite quickly. It's also less prone to overfitting than other more complex models.

In order for our Logistic Regression model to have a good training data, let's use **K-Fold Validation**. Keep in mind, that our dataset has six times more 1s in the target column than 0s. 

In [30]:
cols = loans_0711.columns.tolist()
cols.remove("loan_status")

x_train = loans_0711[cols]
y_train = loans_0711["loan_status"]

**Importing required libraries.**

In [31]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict

**Fitting the model and making predictions**

In [32]:
lr_regressor = LogisticRegression(max_iter=200)
predictions = cross_val_predict(lr_regressor, x_train, y_train, cv=3)

In [33]:
print(pd.Series(predictions).value_counts())
print(pd.Series(predictions).value_counts(normalize=True))


1    39145
0       44
dtype: int64
1    0.998877
0    0.001123
dtype: float64


**Error Metrics**

As mentioned above, there is a significant class imbalance in our dataset. To see how this plays in our predictions, let us consider the **True Positive Rate** and the **False Positive Rates** in our predicted values.
Let's build a function for this so we can call it anytime we want.

In [34]:
# Takes in the predictions coming out from the cross_val_predict function.
def true_positive_rates(predictions):

  predictions = pd.Series(predictions)

  # False Positives
  fp_filter = ((predictions == 1) & (loans_0711["loan_status"]== 0))
  fp = len(predictions[fp_filter])
  print('False Positives =', fp) 

  # True Positives
  tp_filter = ((predictions == 1) & (loans_0711["loan_status"]== 1))
  tp = len(predictions[tp_filter])
  print('True Positives =', tp) 

  # False Negatives
  fn_filter = ((predictions == 0) & (loans_0711["loan_status"]== 1))
  fn = len(predictions[fn_filter])
  print('False Negatives =', fn) 

  # True Negatives
  tn_filter = ((predictions == 0) & (loans_0711["loan_status"]== 0))
  tn = len(predictions[tn_filter])
  print('True Negatives =', tn) 

  print(" ")
  print("-"*35)
  print(" ")

  # True Positive Rate
  tpr = tp / (tp+fn)
  print("True Positive Rate =", tpr*100)

  # False Positive Rate
  fpr = fp / (fp+tn)
  print("False Positive Rate =", fpr*100)

  # Accuracy
  accuracy = (tp + tn)/(tp + tn + fp + fn)*100
  print("Accuracy =", accuracy)


In [35]:
predictions = true_positive_rates(predictions)

False Positives = 5522
True Positives = 33026
False Negatives = 34
True Negatives = 10
 
-----------------------------------
 
True Positive Rate = 99.89715668481549
False Positive Rate = 99.81923355025307
Accuracy = 85.60323383084577


**Tuning the Logistic Regression for a balanced class weight**

Our model has relatively high accuracy of 85%. However, it also has a False Positive Rate of almost 100%. This can be very problematic from the Investor's viewpoint. We need to reduce the False Positive Rate if we are to make conservative investing decisions. However, it comes with a price. Usually, when you try to decrease the FPR, the TPR and accuracy tends to decrease as well. 

A significant reason for this model to show a high FPR and TPR  is because it has 6 times more data relating to "Fully Paid" than the "Charged-off" labels. This is a dataset limitation. One way to reduce this limitation is to penalize the classifier. Let's try the "balanced" penalizer on our model, which asks the algorithm to give more balanced attention to the minority predicting label as it gives for the majority. 

In [36]:
lr_regressor = LogisticRegression(max_iter=200, class_weight='balanced')
predictions = cross_val_predict(lr_regressor, x_train, y_train, cv=3)

predictions = true_positive_rates(predictions)

False Positives = 2730
True Positives = 16867
False Negatives = 16193
True Negatives = 2802
 
-----------------------------------
 
True Positive Rate = 51.019358741681785
False Positive Rate = 49.34924078091106
Accuracy = 50.9665215588723


As we expected, both the TPR and the FPR dropped by almost half. The accuracy has also dropped. This is a trade-off that is non escapable. 
However, an investor would prefer this outcome than the previous, since his/her risk of investing in a loan that does will not be paid back is cut down in half. 

**Tuning the Logistic Regression with a manual penalty**

Let's see if us manually setting a penalty would improve our model.

In [37]:
penalty = {0: 12, 1: 1}

lr_regressor = LogisticRegression(max_iter=200, class_weight=penalty)
predictions = cross_val_predict(lr_regressor, x_train, y_train, cv=3)

predictions = true_positive_rates(predictions)

False Positives = 602
True Positives = 3669
False Negatives = 29391
True Negatives = 4930
 
-----------------------------------
 
True Positive Rate = 11.098003629764065
False Positive Rate = 10.882140274765003
Accuracy = 22.281820066334994


While FPR is now down to 10% it also brought down the TPR to the same amount. And in this model, only 22% of the predictions are correct. This is not a good model for the investors. 

Let's see if using a Random Forrest Classifier make any improvements on our model.

### Random Forrest Regression

In [38]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict

rf = RandomForestClassifier(class_weight="balanced", random_state=1)
predictions = cross_val_predict(rf, x_train, y_train, cv=3)
predictions = true_positive_rates(predictions)

False Positives = 5514
True Positives = 32968
False Negatives = 92
True Negatives = 18
 
-----------------------------------
 
True Positive Rate = 99.72171808832427
False Positive Rate = 99.67462039045553
Accuracy = 85.47367330016584


This is similar to our first Logistic Classifier. While it's accuracy is the highest, the chances of investing in loans that would not be paid, if invested, is almost 100%

### Verdict

The significant class imbalance (6:1) in our dataset seems to have a big impact on our metrics and accuracies. Since this is dataset limitation, the only solutin itself is to bring in more data for the minority class.

That said, we can still choose to use the Linear Regression model that gave out a 50% accuracy and a TPR and FPR close to 50% as it is the most balanced outcome we were able to get, as far as conservative investment methods are concerned. The investor may opt for the Linear Regression model with around 85% accuracy, but in the event of a bad positive prediction, he/she has a 100% chance of losing his money. 

DISCLAIMER: THIS PREDICTING MODEL IS NOT TO BE USED TO MAKE INVESTMENT DECISIONS AS IT IS DONE FOR A PERSONNAL PROJECT. 



Special thanks to Dataquest.io for the inspiration for this project.