# Credit Modelling

In this project we will focus on credit modelling which focuses on modelling a borrower's [credit risk](https://en.wikipedia.org/wiki/Credit_risk).We'll be working with financial lending data from [Lending Club](https://www.lendingclub.com/). Lending Club is a marketplace for personal loans that matches borrowers who are seeking a loan with investors looking to lend money and make a return. You can read more about their marketplace [here](https://www.lendingclub.com/public/how-peer-lending-works.action).

Each borrower completes a comprehensive application, providing their past financial history, the reason for the loan, and more. Lending Club evaluates each borrower's credit score using past historical data and their own data science process to assign an interest rate to the borrower. The interest rate is the percent in addition to the requested loan amount the borrower has to pay back. You can read more about the interest rate that Lending Club assigns [here](https://www.lendingclub.com/public/borrower-rates-and-fees.action).

A higher interest rate means that the borrower is a risk and more unlikely to pay back the loan. While a lower interest rate means that the borrower has a good credit history and is more likely to pay back the loan. The interest rates range from 5.32% all the way to 30.99% and each borrower is given a [grade](https://www.lendingclub.com/public/rates-and-fees.action) according to the interest rate they were assigned. If the borrower accepts the interest rate, then the loan is listed on the Lending Club marketplace.

Investors are primarily interested in receiving a return on their investments. Approved loans are listed on the Lending Club website, where qualified investors can browse recently approved loans, the borrower's credit score, the purpose for the loan, and other information from the application. Once they're ready to back a loan, they select the amount of money they want to fund. Once a loan's requested amount is fully funded, the borrower receives the money they requested minus the [origination fee](https://help.lendingclub.com/hc/en-us/articles/214501207-What-is-the-origination-fee-) that Lending Club charges.

The borrower will make monthly payments back to Lending Club either over 36 months or over 60 months. Lending Club redistributes these payments to the investors. This means that investors don't have to wait until the full amount is paid off before they see a return in money. If a loan is fully paid off on time, the investors make a return which corresponds to the interest rate the borrower had to pay in addition to the requested amount. Many loans aren't completely paid off on time and some borrowers default on the loan.

Below is a diagram that sums up the process:

![image](http://cdn.biblemoneymatters.com/wp-content/uploads/2009/08/how-social-lending-works.jpg)


While Lending Club has to be extremely savvy and rigorous with their credit modelling, investors on Lending Club need to be equally as savvy about determining which loans are more likely to be paid off. At first, you may wonder why investors put money into anything but low interest loans. The incentive investors have to back higher interest loans is, well, the higher interest! If investors believe the borrower can pay back the loan, even if he or she has a weak financial history, then investors can make more money through the larger additional amount the borrower has to pay.

Most investors use a portfolio strategy to invest small amounts in many loans, with healthy mixes of low, medium, and interest loans. In this course, we'll focus on the mindset of a conservative investor who only wants to invest in the loans that have a good chance of being paid off on time. To do that, we'll need to first understand the features in the dataset and then experiment with building machine learning models that reliably predict if a loan will be paid off or not.

# Defining the objective and gathering the data

Lending Club releases data for all of the approved and declined loan applications periodically on their website. You can select different year ranges to download the datasets (in CSV format) for both approved and declined loans. 

A data dictionary for the same is available [here](https://docs.google.com/spreadsheets/d/191B2yJ4H1ZPXq0_ByhUgWMFZOYem5jFz0Y3by_7YBY4/edit).The LoanStats sheet describes the approved loans datasets and the RejectStats describes the rejected loans datasets. Since rejected applications don't appear on the Lending Club marketplace and aren't available for investment, we'll be focusing on approved loans.

The approved loans datasets contain information on current loans, completed loans, and defaulted loans.We need to build a predictive model which will be able to predict if the borrower will pay off the loan on time or not.

n this project, we will focus on approved loans data from 2007 to 2011, since a good number of the loans have already finished. In the datasets for later years, many of the loans are current and still being paid off.

In [1]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)


# Import the data

data=pd.read_csv("loans_2007.csv")

#Drop duplicate rows if any

data.drop_duplicates(inplace=True)

data.head()



  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,,10+ years,RENT,24000.0,Verified,Dec-2011,Fully Paid,n,credit_card,Computer,860xx,AZ,27.65,0.0,Jan-1985,1.0,3.0,0.0,13648.0,83.7%,9.0,f,0.0,0.0,5863.155187,5833.84,5000.0,863.16,0.0,0.0,0.0,Jan-2015,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,Ryder,< 1 year,RENT,30000.0,Source Verified,Dec-2011,Charged Off,n,car,bike,309xx,GA,1.0,0.0,Apr-1999,5.0,3.0,0.0,1687.0,9.4%,4.0,f,0.0,0.0,1008.71,1008.71,456.46,435.17,0.0,117.08,1.11,Apr-2013,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,,10+ years,RENT,12252.0,Not Verified,Dec-2011,Fully Paid,n,small_business,real estate business,606xx,IL,8.72,0.0,Nov-2001,2.0,2.0,0.0,2956.0,98.5%,10.0,f,0.0,0.0,3005.666844,3005.67,2400.0,605.67,0.0,0.0,0.0,Jun-2014,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,AIR RESOURCES BOARD,10+ years,RENT,49200.0,Source Verified,Dec-2011,Fully Paid,n,other,personel,917xx,CA,20.0,0.0,Feb-1996,1.0,10.0,0.0,5598.0,21%,37.0,f,0.0,0.0,12231.89,12231.89,10000.0,2214.92,16.97,0.0,0.0,Jan-2015,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,University Medical Group,1 year,RENT,80000.0,Source Verified,Dec-2011,Current,n,other,Personal,972xx,OR,17.94,0.0,Jan-1996,0.0,15.0,0.0,27783.0,53.9%,38.0,f,461.73,461.73,3581.12,3581.12,2538.27,1042.85,0.0,0.0,0.0,Jun-2016,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


In [2]:
data.shape

(42538, 52)

In [3]:
data.dtypes

id                             object
member_id                     float64
loan_amnt                     float64
funded_amnt                   float64
funded_amnt_inv               float64
term                           object
int_rate                       object
installment                   float64
grade                          object
sub_grade                      object
emp_title                      object
emp_length                     object
home_ownership                 object
annual_inc                    float64
verification_status            object
issue_d                        object
loan_status                    object
pymnt_plan                     object
purpose                        object
title                          object
zip_code                       object
addr_state                     object
dti                           float64
delinq_2yrs                   float64
earliest_cr_line               object
inq_last_6mths                float64
open_acc    

# Data Cleaning and Preparation

Now that we have the data read we will first need to clean the dataset before modelling.For data cleaning purposes we will do the following steps:

1. By making use of the data dictionary remove any redundant columns that do not affect the target column.For example an id assigned by Lending Club will not help to determine the credit risk.
2. Disclose information after the loan has been funded.We need to determine if we want to approve loan for the borrower.Hence this will not be useful.
3. Columns which have redundant information
4. Columns which are poorly formatted and require cleaning before they can be used.
5. Require more data or a lot of processing before they can be used as a feature.


#### Removing Redundant features

After analyzing the columns we can say that the following columns should be removed from analysis:

| Column                  	| Reason for removal                                                            	|
|-------------------------	|-------------------------------------------------------------------------------	|
| id                      	| generated for unique identification                                           	|
| member_id               	| generated for unique identification                                           	|
| funded_amnt             	| available after loan is sanctioned                                            	|
| funded_amnt_inv         	| available after loan is sanctioned                                            	|
| grade                   	| redundant since based on interest rate                                        	|
| sub_grade               	| redundant since based on interest rate                                        	|
| emp_title               	| requires other data and processsing to become useful                          	|
| issue_d                 	| available after loan is sanctioned                                            	|
| zip_code                	| redundant with add_state column because only first three digits are   visible 	|
| out_prncp               	| available after loan is sanctioned                                            	|
| out_prncp_inv           	| available after loan is sanctioned                                            	|
| total_pymnt             	| available after loan is sanctioned                                            	|
| total_pymnt_inv         	| available after loan is sanctioned                                            	|
| total_rec_prncp         	| available after loan is sanctioned                                            	|
| total_rec_int           	| available after loan is sanctioned                                            	|
| total_rec_late_fee      	| available after loan is sanctioned                                            	|
| recoveries              	| available after loan is sanctioned                                            	|
| collection_recovery_fee 	| available after loan is sanctioned                                            	|
| last_pymnt_d            	| available after loan is sanctioned                                            	|
| last_pymnt_amnt         	| available after loan is sanctioned                                            	|

grade and sub_grade are assigned basis the interest rate.Since interest rate is continuous and these two are categorical columns we will retain interest rate and drop the rest of the columns.

In [4]:
# Drop the above columns

data_clean=data.drop(["id","member_id","funded_amnt","funded_amnt_inv","grade","sub_grade","emp_title","issue_d","zip_code","out_prncp",
                     "out_prncp_inv","total_pymnt","total_pymnt_inv","total_rec_prncp","total_rec_int","total_rec_late_fee",
                     "recoveries","collection_recovery_fee","last_pymnt_d","last_pymnt_amnt"],axis=1).copy()

In [5]:
data_clean.shape

(42538, 32)

#### Identifying and cleaning the target column

After exploring the columns we know that the *loan_status* column will be our target column. Let's explore this column.

In [6]:
data_clean["loan_status"].unique()

array(['Fully Paid', 'Charged Off', 'Current', 'In Grace Period',
       'Late (31-120 days)', 'Late (16-30 days)', 'Default', nan,
       'Does not meet the credit policy. Status:Fully Paid',
       'Does not meet the credit policy. Status:Charged Off'],
      dtype=object)

As we can see above the column contains information on whether the loan was fully paid,has delayed payments or was defaulted by the borrower.

In [7]:
data_clean["loan_status"].value_counts()

Fully Paid                                             33136
Charged Off                                             5634
Does not meet the credit policy. Status:Fully Paid      1988
Current                                                  961
Does not meet the credit policy. Status:Charged Off      761
Late (31-120 days)                                        24
In Grace Period                                           20
Late (16-30 days)                                          8
Default                                                    3
Name: loan_status, dtype: int64

There are 8 different values to the target column. We can get more information on these columns from the [Lending Club website](https://help.lendingclub.com/hc/en-us/articles/215488038-What-do-the-different-Note-statuses-mean-). The does not meet credit policy is not available on the site but we can search for it on the internet.

Below is a explaination of the values:


|                     Loan   Status                     	|                                                                        Meaning                                                                        	|
|:-----------------------------------------------------:	|:-----------------------------------------------------------------------------------------------------------------------------------------------------:	|
| Fully Paid                                            	| Loan has been fully paid off.                                                                                                                         	|
| Charged Off                                           	| Loan for which there is no longer a   reasonable expectation of further payments.                                                                     	|
| Does not meet the credit policy.   Status:Fully Paid  	| While the loan was paid off, the loan   application today would no longer meet the credit policy and wouldn't be   approved on to the marketplace.    	|
| Does not meet the credit policy.   Status:Charged Off 	| While the loan was charged off, the loan   application today would no longer meet the credit policy and wouldn't be   approved on to the marketplace. 	|
| In Grace Period                                       	| The loan is past due but still in the grace   period of 15 days.                                                                                      	|
| Late (16-30 days)                                     	| Loan hasn't been paid in 16 to 30 days   (late on the current payment).                                                                               	|
| Late (31-120 days)                                    	| Loan hasn't been paid in 31 to 120 days   (late on the current payment).                                                                              	|
| Current                                               	| Loan is up to date on current payments.                                                                                                               	|
| Default                                               	| Loan is defaulted on and no payment has   been made for more than 121 days.   



From the investor's perspective, we're interested in trying to predict whether loans will be paid off on time. Only the Fully Paid and Charged Off values describe the final outcome of the loan. The other values describe loans that are still ongoing and where the jury is still out on if the borrower will pay back the loan on time or not. While the Default status resembles the Charged Off status, in Lending Club's eyes, loans that are charged off have essentially no chance of being repaid while default ones have a small chance.

Since we are interested only in these two values of Fully Paid and Charged Off treate this problem as a bunary classification problem. We will remove all the rows which do not contain either of these two values. Once that is done we will code the Fully Paid rows as 1 and Charged Off rows as 0.

In [8]:
# Retaining only fully paid and charged off columns
data_clean=data_clean[(data_clean["loan_status"]=="Fully Paid")|(data_clean["loan_status"]=="Charged Off")]

In [9]:
# Converting the fully paid and charged off columns to numeric

mapping_dict={"Fully Paid":1,"Charged Off":0}

data_clean["loan_status"]=data_clean["loan_status"].replace(mapping_dict)

#### Removing features with one single value

Features which have one single value will not really add any value to the model.Hence we will be getting rid of them.

In [10]:
# We can use unique to count the number of unique values in the particular column


def uniq_val_cnts(feature):
    '''
    Function will take input as the colum'''
    feature=feature.dropna()
    '''
    Removing na because unique method counts nan as a unique value'''
    uni_val_cnts=len(feature.unique())
    ''' Count the number of unique values in the column'''
    return uni_val_cnts


for col in data_clean.columns:
    ''' Loop through the columns in data and then use the uniq_val_cnts function'''
    feature_col=data_clean[col]
    cnts=uniq_val_cnts(feature_col)
    if cnts==1:
        data_clean.drop(col,inplace=True,axis=1)
        
        
    

In [11]:
data_clean.shape

(38770, 23)

#### Dealing with missing values

In [12]:
# Identifying the missing values

null_value_counts=data_clean.isnull().sum()

null_value_counts[null_value_counts>0]

emp_length              1036
title                     11
revol_util                50
last_credit_pull_d         2
pub_rec_bankruptcies     697
dtype: int64

As we can see above the emp_length and pub_rec_bankruptcies contain relatively higher missing values compared to the other columns.emp_length is the employment tenure of the borrower and is a significant variable while assessing loan credibility.
We will further inspect pub_rec_bankruptcies column.For the rest of the columns since the number of missing values is low we will not drop the rows where there are any missing values present.

In [13]:
data_clean["pub_rec_bankruptcies"].value_counts(normalize=True,dropna=False)

0.0    0.939438
1.0    0.042456
NaN    0.017978
2.0    0.000129
Name: pub_rec_bankruptcies, dtype: float64

As we can see above the pub_rec_bankruptcies feature has has one value present in ~94% of the data. Since this column is not value adding and also has missing values we will be getting rid of this column.

In [14]:
# Drop the pub_rec_bankruptcies column
data_clean.drop("pub_rec_bankruptcies",axis=1,inplace=True)

# Drop rows with missing values
data_clean.dropna(inplace=True)