# Credit Modelling

In this project we will focus on credit modelling which focuses on modelling a borrower's [credit risk](https://en.wikipedia.org/wiki/Credit_risk).We'll be working with financial lending data from [Lending Club](https://www.lendingclub.com/). Lending Club is a marketplace for personal loans that matches borrowers who are seeking a loan with investors looking to lend money and make a return. You can read more about their marketplace [here](https://www.lendingclub.com/public/how-peer-lending-works.action).

Each borrower completes a comprehensive application, providing their past financial history, the reason for the loan, and more. Lending Club evaluates each borrower's credit score using past historical data and their own data science process to assign an interest rate to the borrower. The interest rate is the percent in addition to the requested loan amount the borrower has to pay back. You can read more about the interest rate that Lending Club assigns [here](https://www.lendingclub.com/public/borrower-rates-and-fees.action).

A higher interest rate means that the borrower is a risk and more unlikely to pay back the loan. While a lower interest rate means that the borrower has a good credit history and is more likely to pay back the loan. The interest rates range from 5.32% all the way to 30.99% and each borrower is given a [grade](https://www.lendingclub.com/public/rates-and-fees.action) according to the interest rate they were assigned. If the borrower accepts the interest rate, then the loan is listed on the Lending Club marketplace.

Investors are primarily interested in receiving a return on their investments. Approved loans are listed on the Lending Club website, where qualified investors can browse recently approved loans, the borrower's credit score, the purpose for the loan, and other information from the application. Once they're ready to back a loan, they select the amount of money they want to fund. Once a loan's requested amount is fully funded, the borrower receives the money they requested minus the [origination fee](https://help.lendingclub.com/hc/en-us/articles/214501207-What-is-the-origination-fee-) that Lending Club charges.

The borrower will make monthly payments back to Lending Club either over 36 months or over 60 months. Lending Club redistributes these payments to the investors. This means that investors don't have to wait until the full amount is paid off before they see a return in money. If a loan is fully paid off on time, the investors make a return which corresponds to the interest rate the borrower had to pay in addition to the requested amount. Many loans aren't completely paid off on time and some borrowers default on the loan.

Below is a diagram that sums up the process:

![image](http://cdn.biblemoneymatters.com/wp-content/uploads/2009/08/how-social-lending-works.jpg)


While Lending Club has to be extremely savvy and rigorous with their credit modelling, investors on Lending Club need to be equally as savvy about determining which loans are more likely to be paid off. At first, you may wonder why investors put money into anything but low interest loans. The incentive investors have to back higher interest loans is, well, the higher interest! If investors believe the borrower can pay back the loan, even if he or she has a weak financial history, then investors can make more money through the larger additional amount the borrower has to pay.

Most investors use a portfolio strategy to invest small amounts in many loans, with healthy mixes of low, medium, and interest loans. In this course, we'll focus on the mindset of a conservative investor who only wants to invest in the loans that have a good chance of being paid off on time. To do that, we'll need to first understand the features in the dataset and then experiment with building machine learning models that reliably predict if a loan will be paid off or not.

# Defining the objective and gathering the data

Lending Club releases data for all of the approved and declined loan applications periodically on their website. You can select different year ranges to download the datasets (in CSV format) for both approved and declined loans. 

A data dictionary for the same is available [here](https://docs.google.com/spreadsheets/d/191B2yJ4H1ZPXq0_ByhUgWMFZOYem5jFz0Y3by_7YBY4/edit).The LoanStats sheet describes the approved loans datasets and the RejectStats describes the rejected loans datasets. Since rejected applications don't appear on the Lending Club marketplace and aren't available for investment, we'll be focusing on approved loans.

The approved loans datasets contain information on current loans, completed loans, and defaulted loans.We need to build a predictive model which will be able to predict if the borrower will pay off the loan on time or not.

n this project, we will focus on approved loans data from 2007 to 2011, since a good number of the loans have already finished. In the datasets for later years, many of the loans are current and still being paid off.

In [1]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)


# Import the data

data=pd.read_csv("loans_2007.csv")

#Drop duplicate rows if any

data.drop_duplicates(inplace=True)

data.head()



  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,,10+ years,RENT,24000.0,Verified,Dec-2011,Fully Paid,n,credit_card,Computer,860xx,AZ,27.65,0.0,Jan-1985,1.0,3.0,0.0,13648.0,83.7%,9.0,f,0.0,0.0,5863.155187,5833.84,5000.0,863.16,0.0,0.0,0.0,Jan-2015,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,Ryder,< 1 year,RENT,30000.0,Source Verified,Dec-2011,Charged Off,n,car,bike,309xx,GA,1.0,0.0,Apr-1999,5.0,3.0,0.0,1687.0,9.4%,4.0,f,0.0,0.0,1008.71,1008.71,456.46,435.17,0.0,117.08,1.11,Apr-2013,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,,10+ years,RENT,12252.0,Not Verified,Dec-2011,Fully Paid,n,small_business,real estate business,606xx,IL,8.72,0.0,Nov-2001,2.0,2.0,0.0,2956.0,98.5%,10.0,f,0.0,0.0,3005.666844,3005.67,2400.0,605.67,0.0,0.0,0.0,Jun-2014,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,AIR RESOURCES BOARD,10+ years,RENT,49200.0,Source Verified,Dec-2011,Fully Paid,n,other,personel,917xx,CA,20.0,0.0,Feb-1996,1.0,10.0,0.0,5598.0,21%,37.0,f,0.0,0.0,12231.89,12231.89,10000.0,2214.92,16.97,0.0,0.0,Jan-2015,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,University Medical Group,1 year,RENT,80000.0,Source Verified,Dec-2011,Current,n,other,Personal,972xx,OR,17.94,0.0,Jan-1996,0.0,15.0,0.0,27783.0,53.9%,38.0,f,461.73,461.73,3581.12,3581.12,2538.27,1042.85,0.0,0.0,0.0,Jun-2016,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


In [2]:
data.shape

(42538, 52)

In [3]:
data.dtypes

id                             object
member_id                     float64
loan_amnt                     float64
funded_amnt                   float64
funded_amnt_inv               float64
term                           object
int_rate                       object
installment                   float64
grade                          object
sub_grade                      object
emp_title                      object
emp_length                     object
home_ownership                 object
annual_inc                    float64
verification_status            object
issue_d                        object
loan_status                    object
pymnt_plan                     object
purpose                        object
title                          object
zip_code                       object
addr_state                     object
dti                           float64
delinq_2yrs                   float64
earliest_cr_line               object
inq_last_6mths                float64
open_acc    

# Data Cleaning and Preparation

Now that we have the data read we will first need to clean the dataset before modelling.For data cleaning purposes we will do the following steps:

1. By making use of the data dictionary remove any redundant columns that do not affect the target column.For example an id assigned by Lending Club will not help to determine the credit risk.
2. Disclose information after the loan has been funded.We need to determine if we want to approve loan for the borrower.Hence this will not be useful.
3. Columns which have redundant information
4. Columns which are poorly formatted and require cleaning before they can be used.
5. Require more data or a lot of processing before they can be used as a feature.


#### Removing Redundant features

After analyzing the columns we can say that the following columns should be removed from analysis:

| Column                  	| Reason for removal                                                            	|
|-------------------------	|-------------------------------------------------------------------------------	|
| id                      	| generated for unique identification                                           	|
| member_id               	| generated for unique identification                                           	|
| funded_amnt             	| available after loan is sanctioned                                            	|
| funded_amnt_inv         	| available after loan is sanctioned                                            	|
| grade                   	| redundant since based on interest rate                                        	|
| sub_grade               	| redundant since based on interest rate                                        	|
| emp_title               	| requires other data and processsing to become useful                          	|
| issue_d                 	| available after loan is sanctioned                                            	|
| zip_code                	| redundant with add_state column because only first three digits are   visible 	|
| out_prncp               	| available after loan is sanctioned                                            	|
| out_prncp_inv           	| available after loan is sanctioned                                            	|
| total_pymnt             	| available after loan is sanctioned                                            	|
| total_pymnt_inv         	| available after loan is sanctioned                                            	|
| total_rec_prncp         	| available after loan is sanctioned                                            	|
| total_rec_int           	| available after loan is sanctioned                                            	|
| total_rec_late_fee      	| available after loan is sanctioned                                            	|
| recoveries              	| available after loan is sanctioned                                            	|
| collection_recovery_fee 	| available after loan is sanctioned                                            	|
| last_pymnt_d            	| available after loan is sanctioned                                            	|
| last_pymnt_amnt         	| available after loan is sanctioned                                            	|

grade and sub_grade are assigned basis the interest rate.Since interest rate is continuous and these two are categorical columns we will retain interest rate and drop the rest of the columns.

In [4]:
# Drop the above columns

data_clean=data.drop(["id","member_id","funded_amnt","funded_amnt_inv","grade","sub_grade","emp_title","issue_d","zip_code","out_prncp",
                     "out_prncp_inv","total_pymnt","total_pymnt_inv","total_rec_prncp","total_rec_int","total_rec_late_fee",
                     "recoveries","collection_recovery_fee","last_pymnt_d","last_pymnt_amnt"],axis=1).copy()

In [5]:
data_clean.shape

(42538, 32)

#### Identifying and cleaning the target column

After exploring the columns we know that the *loan_status* column will be our target column. Let's explore this column.

In [6]:
data_clean["loan_status"].unique()

array(['Fully Paid', 'Charged Off', 'Current', 'In Grace Period',
       'Late (31-120 days)', 'Late (16-30 days)', 'Default', nan,
       'Does not meet the credit policy. Status:Fully Paid',
       'Does not meet the credit policy. Status:Charged Off'],
      dtype=object)

As we can see above the column contains information on whether the loan was fully paid,has delayed payments or was defaulted by the borrower.

In [7]:
data_clean["loan_status"].value_counts()

Fully Paid                                             33136
Charged Off                                             5634
Does not meet the credit policy. Status:Fully Paid      1988
Current                                                  961
Does not meet the credit policy. Status:Charged Off      761
Late (31-120 days)                                        24
In Grace Period                                           20
Late (16-30 days)                                          8
Default                                                    3
Name: loan_status, dtype: int64

There are 8 different values to the target column. We can get more information on these columns from the [Lending Club website](https://help.lendingclub.com/hc/en-us/articles/215488038-What-do-the-different-Note-statuses-mean-). The does not meet credit policy is not available on the site but we can search for it on the internet.

Below is a explaination of the values:


|                     Loan   Status                     	|                                                                        Meaning                                                                        	|
|:-----------------------------------------------------:	|:-----------------------------------------------------------------------------------------------------------------------------------------------------:	|
| Fully Paid                                            	| Loan has been fully paid off.                                                                                                                         	|
| Charged Off                                           	| Loan for which there is no longer a   reasonable expectation of further payments.                                                                     	|
| Does not meet the credit policy.   Status:Fully Paid  	| While the loan was paid off, the loan   application today would no longer meet the credit policy and wouldn't be   approved on to the marketplace.    	|
| Does not meet the credit policy.   Status:Charged Off 	| While the loan was charged off, the loan   application today would no longer meet the credit policy and wouldn't be   approved on to the marketplace. 	|
| In Grace Period                                       	| The loan is past due but still in the grace   period of 15 days.                                                                                      	|
| Late (16-30 days)                                     	| Loan hasn't been paid in 16 to 30 days   (late on the current payment).                                                                               	|
| Late (31-120 days)                                    	| Loan hasn't been paid in 31 to 120 days   (late on the current payment).                                                                              	|
| Current                                               	| Loan is up to date on current payments.                                                                                                               	|
| Default                                               	| Loan is defaulted on and no payment has   been made for more than 121 days.   



From the investor's perspective, we're interested in trying to predict whether loans will be paid off on time. Only the Fully Paid and Charged Off values describe the final outcome of the loan. The other values describe loans that are still ongoing and where the jury is still out on if the borrower will pay back the loan on time or not. While the Default status resembles the Charged Off status, in Lending Club's eyes, loans that are charged off have essentially no chance of being repaid while default ones have a small chance.

Since we are interested only in these two values of Fully Paid and Charged Off treate this problem as a bunary classification problem. We will remove all the rows which do not contain either of these two values. Once that is done we will code the Fully Paid rows as 1 and Charged Off rows as 0.

In [8]:
# Retaining only fully paid and charged off columns
data_clean=data_clean[(data_clean["loan_status"]=="Fully Paid")|(data_clean["loan_status"]=="Charged Off")]

In [9]:
# Converting the fully paid and charged off columns to numeric

mapping_dict={"Fully Paid":1,"Charged Off":0}

data_clean["loan_status"]=data_clean["loan_status"].replace(mapping_dict)

#### Removing features with one single value

Features which have one single value will not really add any value to the model.Hence we will be getting rid of them.

In [10]:
# We can use unique to count the number of unique values in the particular column


def uniq_val_cnts(feature):
    '''
    Function will take input as the colum'''
    feature=feature.dropna()
    '''
    Removing na because unique method counts nan as a unique value'''
    uni_val_cnts=len(feature.unique())
    ''' Count the number of unique values in the column'''
    return uni_val_cnts


for col in data_clean.columns:
    ''' Loop through the columns in data and then use the uniq_val_cnts function'''
    feature_col=data_clean[col]
    cnts=uniq_val_cnts(feature_col)
    if cnts==1:
        data_clean.drop(col,inplace=True,axis=1)
        
        
    

In [11]:
data_clean.shape

(38770, 23)

#### Dealing with missing values

In [12]:
# Identifying the missing values

null_value_counts=data_clean.isnull().sum()

null_value_counts[null_value_counts>0]

emp_length              1036
title                     11
revol_util                50
last_credit_pull_d         2
pub_rec_bankruptcies     697
dtype: int64

As we can see above the emp_length and pub_rec_bankruptcies contain relatively higher missing values compared to the other columns.emp_length is the employment tenure of the borrower and is a significant variable while assessing loan credibility.
We will further inspect pub_rec_bankruptcies column.For the rest of the columns since the number of missing values is low we will not drop the rows where there are any missing values present.

In [13]:
data_clean["pub_rec_bankruptcies"].value_counts(normalize=True,dropna=False)

0.0    0.939438
1.0    0.042456
NaN    0.017978
2.0    0.000129
Name: pub_rec_bankruptcies, dtype: float64

As we can see above the pub_rec_bankruptcies feature has has one value present in ~94% of the data. Since this column is not value adding and also has missing values we will be getting rid of this column.

In [14]:
# Drop the pub_rec_bankruptcies column
data_clean.drop("pub_rec_bankruptcies",axis=1,inplace=True)

# Drop rows with missing values
data_clean.dropna(inplace=True)

#### Converting categorical columns to numeric

In [15]:
# Print the counts for each dtype

data_clean.dtypes.value_counts()

object     11
float64    10
int64       1
dtype: int64

From the above we can see that we have 11 numeric columns and 11 object columns. For modelling purposes we will need to convert these columns to numeric so that they can be used in our model.

In [16]:
# Sorting only the object dtype columns
object_dtype_cols=data_clean.select_dtypes(include=['object'])
object_dtype_cols.head()

Unnamed: 0,term,int_rate,emp_length,home_ownership,verification_status,purpose,title,addr_state,earliest_cr_line,revol_util,last_credit_pull_d
0,36 months,10.65%,10+ years,RENT,Verified,credit_card,Computer,AZ,Jan-1985,83.7%,Jun-2016
1,60 months,15.27%,< 1 year,RENT,Source Verified,car,bike,GA,Apr-1999,9.4%,Sep-2013
2,36 months,15.96%,10+ years,RENT,Not Verified,small_business,real estate business,IL,Nov-2001,98.5%,Jun-2016
3,36 months,13.49%,10+ years,RENT,Source Verified,other,personel,CA,Feb-1996,21%,Apr-2016
5,36 months,7.90%,3 years,RENT,Source Verified,wedding,My wedding loan I promise to pay back,AZ,Nov-2004,28.3%,Jan-2016


From the above we can see that:

1. int_rate and revol_util columns are actually numeric but are stored as categorical due to % sign
2. earliest_cr_line and last_credit_pull_d are actually dates. We will need to perform good amount of feature engineering to get some information from these columns. Hence we will drop these columns.
3. The rest of the columns seem to be categorical. We will explore their unique value counts.

In [17]:
# Drop the earliest_cr_line and last_credit_pull_d columns
data_clean.drop(["earliest_cr_line","last_credit_pull_d"],axis=1,inplace=True)

In [18]:
cols=object_dtype_cols.columns.drop(["int_rate","revol_util","earliest_cr_line","last_credit_pull_d"])

for col in cols:
    val_cnt=object_dtype_cols[col].value_counts(normalize=True)
    print(col)
    print(val_cnt)
    print("----------------------")

term
 36 months    0.749409
 60 months    0.250591
Name: term, dtype: float64
----------------------
emp_length
10+ years    0.226808
< 1 year     0.119788
2 years      0.114214
3 years      0.106755
4 years      0.088998
5 years      0.084990
1 year       0.084300
6 years      0.057784
7 years      0.045494
8 years      0.038275
9 years      0.032595
Name: emp_length, dtype: float64
----------------------
home_ownership
RENT        0.480743
MORTGAGE    0.442893
OWN         0.073736
OTHER       0.002548
NONE        0.000080
Name: home_ownership, dtype: float64
----------------------
verification_status
Not Verified       0.432143
Verified           0.314691
Source Verified    0.253165
Name: verification_status, dtype: float64
----------------------
purpose
debt_consolidation    0.471161
credit_card           0.130352
other                 0.098500
home_improvement      0.074532
major_purchase        0.055289
small_business        0.045627
car                   0.038726
wedding         

From the above we can make following conclusions:

1. term and emp_length columns have some ordering associated with it and hence we will be converting those to numeric
   For emp_length we will be assuming that <1 year of experience will ve 1 year and 10+ years will be 10 years. This is general heuristic but it is not perfect.
2. purpose and title look to have similar information which is mentioned in data dictionary as well. However purpose has fewer features hence we will go ahead with that column
3. The addr_state column also has too many values which will further increase the size of our dataset.Hence we will be getting rid of this column as well.

In [19]:
# Drop the columns addr_state and title
data_clean.drop(["title","addr_state"],axis=1,inplace=True)

In [20]:

# Cleaning the term,int_rate and revol_util columns

def rem_chars(column,character):
    '''
    Function will take in one column and specified character as a input and then strip the specified character from the value''' 
    data_clean[column]=data_clean[column].str.strip(character)
    '''
    Convert the column to float'''
    data_clean[column]=data_clean[column].astype("float")
    
rem_chars("term","months")
rem_chars("int_rate","%")
rem_chars("revol_util","%")


In [21]:
# Cleaning the emp_length columns
# Below less than 1 year and 1 year will be the same as 1 year
import re
data_clean["emp_length"]=data_clean["emp_length"].str.extract(r"(\d+)",expand=False)
data_clean["emp_length"].value_counts()


10    8545
1     7689
2     4303
3     4022
4     3353
5     3202
6     2177
7     1714
8     1442
9     1228
Name: emp_length, dtype: int64

In [22]:
# Encoding the categorical varibles

# Get the dummy variables
dummy_cols=pd.get_dummies(data_clean[["home_ownership","verification_status","purpose"]])

# Concat them to the original dataframe
data_clean=pd.concat([dummy_cols,data_clean],axis=1)


# Drop the original columns
#Also drop one column among the dummy variable columns or else it will show problem of multicollinearity
data_clean.drop(["home_ownership","verification_status","purpose",'home_ownership_NONE','verification_status_Not Verified',
                'purpose_other'],axis=1,inplace=True)

In [23]:
data_clean.columns

Index(['home_ownership_MORTGAGE', 'home_ownership_OTHER', 'home_ownership_OWN',
       'home_ownership_RENT', 'verification_status_Source Verified',
       'verification_status_Verified', 'purpose_car', 'purpose_credit_card',
       'purpose_debt_consolidation', 'purpose_educational',
       'purpose_home_improvement', 'purpose_house', 'purpose_major_purchase',
       'purpose_medical', 'purpose_moving', 'purpose_renewable_energy',
       'purpose_small_business', 'purpose_vacation', 'purpose_wedding',
       'loan_amnt', 'term', 'int_rate', 'installment', 'emp_length',
       'annual_inc', 'loan_status', 'dti', 'delinq_2yrs', 'inq_last_6mths',
       'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc'],
      dtype='object')

# Making Predictions

Now that we have the data ready in a clean and model ready format we can start making our predictions.As we have already mentioned that this is a binary classification problem.We can start with a logistic regression model.

However before we start making any predictions let's divide the features and the target column seperately so that it will be easier to use.


In [24]:
# Divide features and target data
X=data_clean.drop("loan_status",axis=1) # features
Y=data_clean["loan_status"]             # target

In [25]:
data_clean.dtypes

home_ownership_MORTGAGE                  uint8
home_ownership_OTHER                     uint8
home_ownership_OWN                       uint8
home_ownership_RENT                      uint8
verification_status_Source Verified      uint8
verification_status_Verified             uint8
purpose_car                              uint8
purpose_credit_card                      uint8
purpose_debt_consolidation               uint8
purpose_educational                      uint8
purpose_home_improvement                 uint8
purpose_house                            uint8
purpose_major_purchase                   uint8
purpose_medical                          uint8
purpose_moving                           uint8
purpose_renewable_energy                 uint8
purpose_small_business                   uint8
purpose_vacation                         uint8
purpose_wedding                          uint8
loan_amnt                              float64
term                                   float64
int_rate     

### Logistic Regression Model

In [26]:
# Building the model 
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
# Using cross validation scores to validate the model scores

def cross_validate(X,Y,k):
    '''
    Function takes in input features,target,number of folds'''
    kf=KFold(n_splits=k,shuffle=True,random_state=10)
    fold_accuracies=[]
    for train_index,test_index in kf.split(data_clean):
        sc=StandardScaler()
        X=pd.DataFrame(sc.fit_transform(X))
        X_train,X_test=X.iloc[train_index],X.iloc[test_index]
        Y_train,Y_test=Y.iloc[train_index],Y.iloc[test_index]
        lr=LogisticRegression()
        model=lr.fit(X_train,Y_train)
        pred=model.predict(X_test)
        acc=accuracy_score(pred,Y_test)
        fold_accuracies.append(acc)
    return fold_accuracies
 
    

In [27]:
cross_validate(X,Y,4)

[0.8551863255122625,
 0.8587960505361504,
 0.8563541777258732,
 0.8567636440857932]

#### Class Imbalance Problem

As we can see above the accuracy of the model is close to ~85%. However one thing which we need to check is for class imbalance which is the proportions of 1's and 0's in the target column. 

In [28]:
data_clean["loan_status"].value_counts(normalize=True)

1    0.856961
0    0.143039
Name: loan_status, dtype: float64

As we can see above ~85% of our target column has 1 and the rest 14% have 0. 1 will mean the borrower has paid off the loans in time while 0 means the borrower is a defaulter. With such a high proportion of 1's in our data our model can get biased towards predicting 1.Even in the case our model randomly predicts 1 for every single row it will still give the accuracy value as 85% which is misleading.

For our problem statement instead of predicting correctly the borrowers who will not default we are interested to predict people who will default since that will be a huge loss for the investors. Hence we should optimize for the below parameters:
1. high [recall](https://en.wikipedia.org/wiki/Precision_and_recall#Recall) (True Positive Rate) 
2. low [fallout](https://en.wikipedia.org/wiki/Information_retrieval#Fall-out) (False Positive Rate)

True Positive Rate is the percentage of loans which were funded as per model and did not default in actual.
False Postive Rate is the percentage of loans which were not funded as per model but defaulted in actual.

Generally if we reduce False Positive Rate it will reduce True Positive Rate as well. This is because it will if we want to reduce risk of false positives we will not want to sanction risky loan aaplications. 

Now that we know the above we can calculate the relevant metrics for our model and then look at the model score.

In [29]:
# Using cross validation scores to validate the model scores with TPR and FPR

def cross_validate_tpr_fpr(X,Y,k):
    '''
    Function takes in input features,target,number of folds'''
    kf=KFold(n_splits=k,shuffle=True,random_state=10)
    tpr_metrics=[]
    fpr_metrics=[]
    for train_index,test_index in kf.split(data_clean):
        sc=StandardScaler()
        X=pd.DataFrame(sc.fit_transform(X))
        X_train,X_test=X.iloc[train_index],X.iloc[test_index]
        Y_train,Y_test=Y.iloc[train_index],Y.iloc[test_index]
        lr=LogisticRegression()
        model=lr.fit(X_train,Y_train)
        predictions=model.predict(X_test)
        
        # False Positives
        fp_filter = (predictions == 1) & (Y_test== 0)
        fp = len(predictions[fp_filter])

        # True positives.
        tp_filter = (predictions == 1) & (Y_test== 1)
        tp = len(predictions[tp_filter])

        # False negatives.
        fn_filter = (predictions == 0) & (Y_test== 1)
        fn = len(predictions[fn_filter])

        # True negatives
        tn_filter = (predictions == 0) & (Y_test== 0)
        tn = len(predictions[tn_filter])
        
        # Rates
        tpr = tp  / (tp + fn)
        fpr = fp  / (fp + tn)
        
        tpr_metrics.append(tpr)
        fpr_metrics.append(fpr)
    print("tpr_metrics")
    print(tpr_metrics)
    print("------------------")
    print("fpr_metrics")
    print(fpr_metrics)

In [30]:
cross_validate_tpr_fpr(X,Y,4)

tpr_metrics
[0.99603026919737, 0.9970330077883546, 0.9970219630227075, 0.9962857496595271]
------------------
fpr_metrics
[0.9808541973490427, 0.9819548872180451, 0.9772058823529411, 0.9835943325876212]


As we can see above the model has exactly behaved the same way it would if it randomly assigned 1's to every row. If we would have gone by accuracy metric our model was supposed to be ~85% accurate. However if we look at the False Positive rates the model has predicted ~98% values as non defaulters who in reality defaulted the loan. Hence due to the class imbalance problem or model is not able to predict the defaulters well which is our priority.

#### Dealing with the class imbalance problem

1. Use oversampling and undersampling to preserve the ratio of both the classes in the data
2. Penalize the classifier for incorrectly predicting the less relevant class more than the other class.

The dowside of the first technique is that:

1. Throw out a large proportion of the data to preserve the ratio i.e. delete rows with 1's.
2. Copy rows multiple times i.e. copy the rows with 0's multiple times to preserve the ratio.
3. Generate fake data i.e generate additional new rows for 0's to equalize the ratio.

Unfortunately none of the above technique is easy to implement. Hence we will conside the penalizing the classifier to improve the False Positive Rate. This can be done by setting the class_weight parameter to balanced.The penalty is set to be inversely proportional to the class frequencies.

In [31]:
def cross_validate_imblance(X,Y,k):
    '''
    Function takes in input features,target,number of folds'''
    kf=KFold(n_splits=k,shuffle=True,random_state=10)
    tpr_metrics=[]
    fpr_metrics=[]
    for train_index,test_index in kf.split(data_clean):
        sc=StandardScaler()
        X=pd.DataFrame(sc.fit_transform(X))
        X_train,X_test=X.iloc[train_index],X.iloc[test_index]
        Y_train,Y_test=Y.iloc[train_index],Y.iloc[test_index]
        lr=LogisticRegression(class_weight='balanced')
        model=lr.fit(X_train,Y_train)
        predictions=model.predict(X_test)
        
        # False Positives
        fp_filter = (predictions == 1) & (Y_test== 0)
        fp = len(predictions[fp_filter])

        # True positives.
        tp_filter = (predictions == 1) & (Y_test== 1)
        tp = len(predictions[tp_filter])

        # False negatives.
        fn_filter = (predictions == 0) & (Y_test== 1)
        fn = len(predictions[fn_filter])

        # True negatives
        tn_filter = (predictions == 0) & (Y_test== 0)
        tn = len(predictions[tn_filter])
        
        # Rates
        tpr = tp  / (tp + fn)
        fpr = fp  / (fp + tn)
        
        tpr_metrics.append(tpr)
        fpr_metrics.append(fpr)
    print("tpr_metrics")
    print(tpr_metrics)
    print("------------------")
    print("fpr_metrics")
    print(fpr_metrics)

cross_validate_imblance(X,Y,4)

tpr_metrics
[0.6537650415581193, 0.6702929904809, 0.6648467551805435, 0.6657174693574347]
------------------
fpr_metrics
[0.35125184094256257, 0.37218045112781956, 0.37720588235294117, 0.37658463832960476]


Above we can see that we have significantly reduced the False Positive Rate to ~38%.However our True Positive Rate has been decreased to ~60%. From a conservative investor stand point it is good that the fpr is lower because we woud do a good job predicting bad loans than fund them all.

We can attempt to improve the False Positive Rate further by increasing the penalty. Currently the penalty is assigned basis the frequency of the 1's and 0's(It would be around 5.89 since there are 5.89 times 1's as many 0's).However we can manually assign the penalties at a higher rate to further improve the False Positive Rate.

In [32]:
# Manually assigning the penalties

penalty={0:10,1:1}

def cross_validate_imblance_incr_penalty(X,Y,k):
    '''
    Function takes in input features,target,number of folds'''
    kf=KFold(n_splits=k,shuffle=True,random_state=10)
    tpr_metrics=[]
    fpr_metrics=[]
    for train_index,test_index in kf.split(data_clean):
        sc=StandardScaler()
        X=pd.DataFrame(sc.fit_transform(X))
        X_train,X_test=X.iloc[train_index],X.iloc[test_index]
        Y_train,Y_test=Y.iloc[train_index],Y.iloc[test_index]
        lr=LogisticRegression(class_weight=penalty)
        model=lr.fit(X_train,Y_train)
        predictions=model.predict(X_test)
        
        # False Positives
        fp_filter = (predictions == 1) & (Y_test== 0)
        fp = len(predictions[fp_filter])

        # True positives.
        tp_filter = (predictions == 1) & (Y_test== 1)
        tp = len(predictions[tp_filter])

        # False negatives.
        fn_filter = (predictions == 0) & (Y_test== 1)
        fn = len(predictions[fn_filter])

        # True negatives
        tn_filter = (predictions == 0) & (Y_test== 0)
        tn = len(predictions[tn_filter])
        
        # Rates
        tpr = tp  / (tp + fn)
        fpr = fp  / (fp + tn)
        
        tpr_metrics.append(tpr)
        fpr_metrics.append(fpr)
    print("tpr_metrics")
    print(tpr_metrics)
    print("------------------")
    print("fpr_metrics")
    print(fpr_metrics)

cross_validate_imblance_incr_penalty(X,Y,4)

tpr_metrics
[0.3974692966133234, 0.39201384596365435, 0.40612979277826033, 0.4051009038009162]
------------------
fpr_metrics
[0.14285714285714285, 0.14887218045112782, 0.1588235294117647, 0.1558538404175988]


As we can see manually assigning the penalties has further reduced our False Positive Rate to ~14%. However note that this has come at the cost of reduction of true positive rate as well. While we are reducing the chances of a loan being defaulted we are ultimately making less money by rejecting potential non defaulters as well.Hence we need to be mindful of this tradeoff as well.

We can further try looking at the feature importance and consider only those features relevant to our model.

#### Feature Importance using Recursive Feature Elimination

Till now we have used all the features present in the model. Further we can try selecting the most relevant features and try building the model using these features only. To do so we can use the RFECV - Recursive Feature Elimination module from python.

In [33]:
from sklearn.feature_selection import RFECV
from sklearn.model_selection import train_test_split


#Scaling X
sc=StandardScaler()
X_scaled=pd.DataFrame(sc.fit_transform(X))

#Dividing the data into train and test
X_train,X_test,Y_train,Y_test=train_test_split(X_scaled,Y,test_size=0.2)

# Using RFECV to select the best features
lr= LogisticRegression(class_weight='balanced')
selector = RFECV(lr,cv=10)  # cv is no of folds for cross validaton
selector.fit(X_train,Y_train)

# Building the model using the best features given by RFECV
model_RFECV=lr.fit(X_train[X_train.columns[selector.support_]],Y_train) #support will give index of cols selected as best features

predictions_RFECV=model_RFECV.predict(X_test[X_test.columns[selector.support_]]) 

# False Positives
fp_filter = (predictions_RFECV == 1) & (Y_test== 0)
fp = len(predictions_RFECV[fp_filter])

# True positives.
tp_filter = (predictions_RFECV == 1) & (Y_test== 1)
tp = len(predictions_RFECV[tp_filter])

# False negatives.
fn_filter = (predictions_RFECV == 0) & (Y_test== 1)
fn = len(predictions_RFECV[fn_filter])

# True negatives
tn_filter = (predictions_RFECV == 0) & (Y_test== 0)
tn = len(predictions_RFECV[tn_filter])

# Rates
tpr = tp  / (tp + fn)
fpr = fp  / (fp + tn)


print("tpr")
print(tpr)
print("------------------")
print("fpr")
print(fpr)


tpr
0.6573275862068966
------------------
fpr
0.38209817131857554


#### Oversampling

We can see above that we do not really have a siginificant improvement in the TPR and FPR rates. Since we have the 0 class in lower proportion we can try oversampling this class and create additional data and try building the model.

To oversample we will be using the SMOTE technique.SMOTE (Synthetic Minority Oversampling Technique) works by randomly picking a point from the minority class and computing the k-nearest neighbors for this point. The synthetic points are added between the chosen point and its neighbors.


In [34]:
from imblearn.over_sampling import SMOTE

In [35]:
from imblearn.over_sampling import SMOTE

smote = SMOTE()

# fit predictor and target variable
X_smote, Y_smote = smote.fit_resample(X,Y)

print(X.shape)
print(Y.shape)

(37675, 33)
(37675,)


In [36]:
Y_smote.value_counts()

0    32286
1    32286
Name: loan_status, dtype: int64

As we can see above we now have equal counts of 0 and 1 in the data. Let's try building the model on this data. We can directly use the cross_validate_tpr_fpr function we created in the above cells.

In [37]:
cross_validate_tpr_fpr(X_smote,Y_smote,4)

tpr_metrics
[0.99603026919737, 0.9970330077883546, 0.9970219630227075, 0.9961619413148446]
------------------
fpr_metrics
[0.9808541973490427, 0.9819548872180451, 0.9772058823529411, 0.9835943325876212]


#### Conclusion for Logistic Regression model

As we can see above the tpr is ~67% and fpr is ~37%. The model even after selecting the best features has not really shown any improvement over the initial model we build while dealing with class imbalance. While the above model can be considered good enough but we still are loosing out on potential customers with our true positive rate being slightly on the lower side.

We can further try other models and check if they are giving better accuracy compared to the Logistic Regresison model.

### Random Forest Model

Random Forests are known to give better accuracy compared to other models. However they also tend to overfit as well. We will verify our results using cross validation so as to look for the occurence of overfiiting. 

Initially we will fit a Random Forest model using the default parameters. Later on we can try using hyper parameter tuning to look for any improvement.

In [38]:

from sklearn.ensemble import RandomForestClassifier

def cross_validate_tpr_fpr_rf(X,Y,k):
    '''
    Function takes in input features,target,number of folds'''
    kf=KFold(n_splits=k,shuffle=True,random_state=10)
    tpr_metrics=[]
    fpr_metrics=[]
    for train_index,test_index in kf.split(data_clean):
        X_train,X_test=X.iloc[train_index],X.iloc[test_index]
        Y_train,Y_test=Y.iloc[train_index],Y.iloc[test_index]
        rf=RandomForestClassifier()
        model=rf.fit(X_train,Y_train)
        predictions=model.predict(X_test)
        
        # False Positives
        fp_filter = (predictions == 1) & (Y_test== 0)
        fp = len(predictions[fp_filter])

        # True positives.
        tp_filter = (predictions == 1) & (Y_test== 1)
        tp = len(predictions[tp_filter])

        # False negatives.
        fn_filter = (predictions == 0) & (Y_test== 1)
        fn = len(predictions[fn_filter])

        # True negatives
        tn_filter = (predictions == 0) & (Y_test== 0)
        tn = len(predictions[tn_filter])
        
        # Rates
        tpr = tp  / (tp + fn)
        fpr = fp  / (fp + tn)
        
        tpr_metrics.append(tpr)
        fpr_metrics.append(fpr)
    print("tpr_metrics")
    print(tpr_metrics)
    print("------------------")
    print("fpr_metrics")
    print(fpr_metrics)
    
cross_validate_tpr_fpr_rf(X,Y,4)

tpr_metrics
[0.9981391886862672, 0.9967857584373842, 0.9968978781486537, 0.9977714497957162]
------------------
fpr_metrics
[0.9867452135493373, 0.9864661654135338, 0.9742647058823529, 0.9888143176733781]


As we can see above Random Forest is also showing up the same issue similar to Logistic Regression model. It is giving a very good True Positive Rate ~99% and a very high False Positive Rate ~98%. We can try to hypertune the paramters to check if that improves the accuracy of the model.

For hyperparameter tuning we can use the RandomizedSearchCV library in python which will reandomly check the parameters we will be passing in the grid and throw out the best model for us.

In [39]:
from sklearn.model_selection import RandomizedSearchCV

# Create the different values for features

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 5)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 5)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf
              }

print(random_grid)

{'n_estimators': [200, 650, 1100, 1550, 2000], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 35, 60, 85, 110, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4]}


In [1]:
# Use the above grid to search for best parameters
rf=RandomForestClassifier()

rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 10, 
                               cv = 4, verbose=2, random_state=42, n_jobs = -1)
            # n_iter is no of combinations we want and cv is no of folds for cross validation

#Dividing the data into train and test
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2)

rf_random.fit(X_train,Y_train)

NameError: name 'RandomForestClassifier' is not defined

In [41]:
# Get the model with the best paramters
best_rf_model=rf_random.best_estimator_

predictions_best_rf=best_rf_model.predict(X_test) 

# False Positives
fp_filter = (predictions_best_rf == 1) & (Y_test== 0)
fp = len(predictions_best_rf[fp_filter])

# True positives.
tp_filter = (predictions_best_rf == 1) & (Y_test== 1)
tp = len(predictions_best_rf[tp_filter])

# False negatives.
fn_filter = (predictions_best_rf == 0) & (Y_test== 1)
fn = len(predictions_best_rf[fn_filter])

# True negatives
tn_filter = (predictions_best_rf == 0) & (Y_test== 0)
tn = len(predictions_best_rf[tn_filter])

# Rates
tpr = tp  / (tp + fn)
fpr = fp  / (fp + tn)


print("tpr")
print(tpr)
print("------------------")
print("fpr")
print(fpr)


tpr
0.9987596899224807
------------------
fpr
0.9953917050691244


#### Conclusion for RandomForestClassifier

As we can see above the random forest classifier does not seem to do a good job with lowering the False Positive Rate even after hyper parameter tuning.

We need to try some other model and check if there is any improvement in the model. Futher we will try using support vector machines and check if there is any improvement in the model. 

### Support Vector Machines

SVM is a supervised algorithm used to classify a two group problem. Further we will try to implement the SVM algorithm below to check if it improves the accuracy of the model.

In [42]:
# First we will scale the data and split it into train and test
# We have already imported the required libraries above.

#Scaling X
sc=StandardScaler()
X_scaled=pd.DataFrame(sc.fit_transform(X))

#Dividing the data into train and test
X_train,X_test,Y_train,Y_test=train_test_split(X_scaled,Y,test_size=0.2)

from sklearn.svm import SVC
sv=SVC(kernel="rbf",random_state=0,class_weight='balanced')  # class weight balanced to account for class imbalance
svc_model=sv.fit(X_train,Y_train)


In [43]:
predictions_svm=svc_model.predict(X_test) 

# False Positives
fp_filter = (predictions_svm == 1) & (Y_test== 0)
fp = len(predictions_svm[fp_filter])

# True positives.
tp_filter = (predictions_svm == 1) & (Y_test== 1)
tp = len(predictions_svm[tp_filter])

# False negatives.
fn_filter = (predictions_svm == 0) & (Y_test== 1)
fn = len(predictions_svm[fn_filter])

# True negatives
tn_filter = (predictions_svm == 0) & (Y_test== 0)
tn = len(predictions_svm[tn_filter])

# Rates
tpr = tp  / (tp + fn)
fpr = fp  / (fp + tn)


print("tpr")
print(tpr)
print("------------------")
print("fpr")
print(fpr)


tpr
0.6626786824114357
------------------
fpr
0.3958143767060964
