Introduction to the data

Lending Club releases data for all of the approved and declined loan applications periodically on their website. You can select a few different year ranges to download the datasets (in CSV format) for both approved and declined loans.

You'll also find a data dictionary (in XLS format) which contains information on the different column names towards the bottom of the page. We recommend downloading the data dictionary to so you can refer to it whenever you want to learn more about what a column represents in the datasets. Here's a link to the data dictionary file hosted on Google Drive.

Before diving into the datasets themselves, let's get familiar with the data dictionary. The LoanStats sheet describes the approved loans datasets and the RejectStats describes the rejected loans datasets. Since rejected applications don't appear on the Lending Club marketplace and aren't available for investment, we'll be focusing on data on approved loans only.

The approved loans datasets contain information on current loans, completed loans, and defaulted loans. Let's now define the problem statement for this machine learning project:

    Can we build a machine learning model that can accurately predict if a borrower will pay off their loan on time or not?

Before we can start doing machine learning, we need to define what features we want to use and which column repesents the target column we want to predict. Let's start by reading in the dataset and exploring it.

In [1]:
import pandas as pd
loans_2007 = pd.read_csv("loans_2007.csv")

#import pandas as pd
#loans_2007=pd.read_csv("LoanStats3a.csv",skiprows=1)
#half_count =len(loans_2007)/2
#loans_2007=loans_2007.dropna(thresh=half_count,axis=1)
#loans_2007=loans_2007.drop(['desc_url'],axis=1)
#loans_2007.to_csv('loans_2007.csv',index=False)

loans_2007.drop_duplicates()
print(loans_2007.iloc[0])
print(loans_2007.shape[1])

  interactivity=interactivity, compiler=compiler, result=result)


id                                1077501
member_id                      1.2966e+06
loan_amnt                            5000
funded_amnt                          5000
funded_amnt_inv                      4975
term                            36 months
int_rate                           10.65%
installment                        162.87
grade                                   B
sub_grade                              B2
emp_title                             NaN
emp_length                      10+ years
home_ownership                       RENT
annual_inc                          24000
verification_status              Verified
issue_d                          Dec-2011
loan_status                    Fully Paid
pymnt_plan                              n
purpose                       credit_card
title                            Computer
zip_code                            860xx
addr_state                             AZ
dti                                 27.65
delinq_2yrs                       

The Dataframe contains many columns and can be cumbersome to try to explore all at once. Let's break up the columns into 3 groups of 18 columns and use the data dictionary to become familiar with what each column represents. As you understand each feature, you want to pay attention to any features that:

    leak information from the future (after the loan has already been funded)
    don't affect a borrower's ability to pay back a loan (e.g. a randomly generated ID value by Lending Club)
    formatted poorly and need to be cleaned up
    require more data or a lot of processing to turn into a useful feature
    contain redundant information

We need to especially pay attention to data leakage, since it can cause our model to overfit. This is because the model would be using data about the target column that wouldn't be available when we're using the model on future loans. We encourage you to spend as much time as you need to understand each column, because a poor understanding could cause you to make mistakes in the data analysis and modeling process. As you go through the dictionary, keep in mind that we need to select one of the columns as the target column we want to use when we move on to the machine learning phase.

In this screen and the next few screens, let's focus on just columns that we need to remove from consideration. Then, we can circle back and further dissect the columns we decided to keep.

To make this process easier, we created a table that contains the name, data type, first row's value, and description from the data dictionary for the first 18 rows. 

In [2]:
loans_2007 = loans_2007.drop(["id", "member_id", "funded_amnt", "funded_amnt_inv",
                              "grade", "sub_grade", "emp_title", "issue_d"], axis=1)

In [3]:
loans_2007 = loans_2007.drop(["zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", 
                              "total_pymnt_inv", "total_rec_prncp"], axis=1)

In [4]:
loans_2007 = loans_2007.drop(["total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", 
                              "last_pymnt_d", "last_pymnt_amnt"], axis=1)

Target column

Just by becoming familiar with the columns in the dataset, we were able to reduce the number of columns from 52 to 32 columns. We now need to decide on a target column that we want to use for modeling.

We should use the loan_status column, since it's the only column that directly describes if a loan was paid off on time, had delayed payments, or was defaulted on the borrower. Currently, this column contains text values and we need to convert it to a numerical one for training a model. Let's explore the different values in this column and come up with a strategy for converting the values in this column.

In [5]:
loans_2007['loan_status'].value_counts()

Fully Paid                                             33136
Charged Off                                             5634
Does not meet the credit policy. Status:Fully Paid      1988
Current                                                  961
Does not meet the credit policy. Status:Charged Off      761
Late (31-120 days)                                        24
In Grace Period                                           20
Late (16-30 days)                                          8
Default                                                    3
Name: loan_status, dtype: int64

Let's remove all the loans that don't contain either Fully Paid and Charged Off as the loan's status and then transform the Fully Paid values to 1 for the positive case and the Charged Off values to 0 for the negative case.

In [6]:
loans_2007 = loans_2007[(loans_2007['loan_status'] == "Fully Paid") | (loans_2007['loan_status'] == "Charged Off")]

mapping_dict= {"loan_status":{"Fully Paid":1,"Charged Off":0}}
loans_2007=loans_2007.replace(mapping_dict)

 Removing single value columns -- Since it won't' help in prediction

In [7]:
drop_columns=[]
for col in loans_2007.columns:
    non_null=loans_2007[col].dropna()
    unique_non_null=non_null.unique()
    num_true_unique=len(unique_non_null)
    if num_true_unique ==1 :
        drop_columns.append(col)
print(drop_columns)
loans_2007=loans_2007.drop(drop_columns,axis=1)

['pymnt_plan', 'initial_list_status', 'collections_12_mths_ex_med', 'policy_code', 'application_type', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'tax_liens']
