# Data Cleaning

## Date: OCT 10, 2023

-- ------------------------


## Introduction

This notebook cleans the data for the lending club accepted loans. Due to the size of the dataset, the csv is read in chunks, with a random sample taken from each each chunk. Only fully paid and charged off / defaulted loans are sampled as loans that were in-progress when the data was collected hold no value in classifying the target variable. Loading only a random sample of the relevant loans is more efficient than loading the entire dataset, while still keeping the distribution the same. Those samples are merged and are then used as the working dataset for the rest of the project. Unnecessary and leaky features are removed, with the remaining features being formatted and having their null values removed. Finally the dataframe's size is reduced as much as possible, then exported. Two files are exported, one for the models, and one for EDA.

### Table-of-contents

1. [Introduction](#Introduction)
   - [Table-of-contents](#Table-of-contents)
   - [Import-Librarys](#Import-Librarys)
   - [Data Dictionary](#Data-Dictionary)
   - [Define-Functions](#Define-Functions)
   - [Load in the data](#Load-the-data)
3. [Data Cleaning](#Cleaning)
   - [Initial Exploration](#Initial-Exploration)
   - [Feature Pruning](#Feature-Pruning)
   - [Feature Engineering](#Feature-Engineering)
   - [Dataframe Null Values](#Dataframe-Null-Values)
4. [Dataframe optimization](#Dataframe-optimization)
5. [Feature Engineering](#Feature-Engineering)
6. [Conclusion](#Conclusion)


### Import-Librarys

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import pdcast as pdc

from pathlib import Path

### Data-Dictionary

In [2]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

In [3]:
try:
    data_destination = Path('../Data/Lending Club Data Dictionary Approved.csv')
    dict_df = pd.read_csv(data_destination, encoding='ISO-8859-1')
    display(dict_df.iloc[:,0:2])
except FileNotFoundError as e:
    print(e.args[1])
    print('Check file location')

Unnamed: 0,LoanStatNew,Description
0,acc_now_delinq,The number of accounts on which the borrower is now delinquent.
1,acc_open_past_24mths,Number of trades opened in past 24 months.
2,addr_state,The state provided by the borrower in the loan application
3,all_util,Balance to credit limit on all trades
4,annual_inc,The self-reported annual income provided by the borrower during registration.
5,annual_inc_joint,The combined self-reported annual income provided by the co-borrowers during registration
6,application_type,Indicates whether the loan is an individual application or a joint application with two co-borrowers
7,avg_cur_bal,Average current balance of all accounts
8,bc_open_to_buy,Total open to buy on revolving bankcards.
9,bc_util,Ratio of total current balance to high credit/credit limit for all bankcard accounts.


#### Define-Functions

In [4]:
def map_emp_length(employment_length:str):
    '''
    Takes in employment length and returns an int for mapping

    :param employment_length: The employment length to be mapped
    :type employment_length: str

    :return: The int employment length should be mapped to
    :type return: int
    '''
    if employment_length == '< 1 year':
        return 0.5
    elif employment_length == '10+ years':
        return 10
    elif 'years' or 'year' in employment_length:
        return int(employment_length.split()[0])
    elif employment_length == '0':
        return 0
    else:
        return employment_length

When initially loading in the dataset, Pandas raised a DtypeWarning over mixed datatypes within various columns. Setting low_memory = False while breaking the CSV into chunks allows Pandas to load an entire chunk before guessing the data types. When the script to scrape the data dictionary is finished, the data dict can then be passed in instead of relying on pandas. The mixed_data_types function is stilled called as a sanity check.

In [5]:
def mixed_data_types(df:pd.DataFrame) -> bool:
    '''
    Takes in a dataframe and checks for columns with mixed data types
    If none are found return False, else True
    
    :param df: The dataframe to be checked
    :type df: obj
    :return bool: True if found, false if none were found
    :type return: bool
    '''
    
    #loop through each column
    for column in df:

        #filter out int datatype coming from Nan and get unique data types
        unique_types = df[column].dropna(inplace=False).apply(type).unique()

        #if there are more than 1 datatype in a column
        if unique_types.size > 1:
            return True
    
    return False

#### Load the data

Due to the size of the dataset, the data is pre-processed and read in chunks. Only completed loans are randomly sampled from each chunk, checked for mixed data types, and then combined. This single sample is representative of the whole dataset, and is used for the rest of the project.

In [6]:
# Adjust chunk_size depending how much memory you have, and sample size for how large of a final dataset you want
chunk_size = 5*100000
sample_size =  100000
random_state = 11

assert sample_size < chunk_size, f"Cannot take a sample of {sample_size} rows out of {chunk_size} rows"

print(f'Chunk size: {chunk_size} rows')
print(f'Rows to be sampled: {sample_size} rows')

sampled_dataframes = []
try:
    
    # Path to the data. Should be under Data/
    data_destination = Path('../Data/accepted_2007_to_2018Q4.csv')

    # Split the csv into chunks and iterate over each chunk
    # Set low_memory to false to force pandas to load entire columns before guessing data type
    with pd.read_csv(data_destination, chunksize=chunk_size, low_memory = False) as reader:
        for count, chunk in enumerate(reader):
            if mixed_data_types(df=chunk) == True:
                raise Exception("Mixed data types found")

            # List of finished loan status's
            finished_loan_status = ['Fully Paid',
                                    'Charged Off',
                                    'Does not meet the credit policy. Status:Fully Paid',
                                    'Does not meet the credit policy. Status:Charged Off',
                                    'Default']
                        
            # Filter the dataframe for loans that are finished or null
            filtered_chunk = chunk.loc[chunk['loan_status'].isin(finished_loan_status) | chunk['loan_status'].isnull()]
            
            # Sample the filtered df
            sampled_df = filtered_chunk.sample(n=sample_size, random_state=random_state)
            sampled_dataframes.append(sampled_df)
            
            print(f"{count} sampled dataframe shape: {sampled_df.shape}")
        print('Finished')

except FileNotFoundError as e:
    print(e.args[1])
    print('Check file name and location')
    
except Exception as e:
    print(e.args[1])

Chunk size: 500000 rows
Rows to be sampled: 100000 rows
0 sampled dataframe shape: (100000, 151)
1 sampled dataframe shape: (100000, 151)
2 sampled dataframe shape: (100000, 151)
3 sampled dataframe shape: (100000, 151)
4 sampled dataframe shape: (100000, 151)
Finished


There are no duplicate datatypes within any columns. The random samples can be combined into a single sample dataframe. This sample will be used as the working dataset.

In [7]:
sample_accepted_df = pd.concat(sampled_dataframes, ignore_index=False)

&nbsp;

## Cleaning

### Initial Exploration

***Display the first 5 rows*** 

In [8]:
sample_accepted_df.head(5)

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,fico_range_low,fico_range_high,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,last_fico_range_high,last_fico_range_low,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_fico_range_low,sec_app_fico_range_high,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
50867,65733661,,23000.0,23000.0,23000.0,36 months,12.05,764.48,C,C1,Owner,3 years,RENT,50000.0,Source Verified,Dec-2015,Fully Paid,n,https://lendingclub.com/browse/loanDetail.action?loan_id=65733661,,debt_consolidation,Debt consolidation,331xx,FL,13.9,0.0,Dec-2005,700.0,704.0,1.0,61.0,,8.0,0.0,9976.0,45.6,12.0,w,0.0,0.0,27505.815541,27505.82,23000.0,4505.82,0.0,0.0,0.0,Dec-2018,764.42,,Jan-2019,709.0,705.0,0.0,61.0,1.0,Individual,,,,0.0,0.0,11973.0,,,,,,,,,,,,21900.0,,,,2.0,1710.0,11924.0,45.6,0.0,0.0,110.0,119.0,5.0,5.0,0.0,5.0,74.0,5.0,74.0,2.0,4.0,4.0,4.0,6.0,5.0,5.0,7.0,4.0,8.0,0.0,0.0,0.0,1.0,81.8,0.0,0.0,0.0,28207.0,11973.0,21900.0,6307.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
373353,40959934,,35000.0,35000.0,35000.0,60 months,17.57,880.61,D,D4,Global Account Manager,2 years,RENT,110000.0,Verified,Feb-2015,Fully Paid,n,https://lendingclub.com/browse/loanDetail.action?loan_id=40959934,,debt_consolidation,Debt consolidation,774xx,TX,31.63,0.0,May-1994,660.0,664.0,0.0,38.0,,27.0,0.0,50450.0,75.6,38.0,f,0.0,0.0,51912.964799,51912.96,35000.0,16912.96,0.0,0.0,0.0,Feb-2019,10609.7,,Aug-2017,669.0,665.0,0.0,38.0,1.0,Individual,,,,0.0,541.0,304808.0,,,,,,,,,,,,66700.0,,,,3.0,11289.0,9169.0,81.6,0.0,0.0,138.0,193.0,11.0,3.0,1.0,47.0,,12.0,,0.0,7.0,10.0,8.0,13.0,17.0,14.0,19.0,10.0,27.0,0.0,0.0,0.0,3.0,91.9,50.0,0.0,0.0,336948.0,215616.0,49900.0,134250.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
185185,56007123,,33100.0,33100.0,33100.0,36 months,13.99,1131.12,C,C4,Shipping and Logistics,1 year,MORTGAGE,72000.0,Source Verified,Aug-2015,Fully Paid,n,https://lendingclub.com/browse/loanDetail.action?loan_id=56007123,,debt_consolidation,Debt consolidation,786xx,TX,12.13,0.0,Dec-1994,750.0,754.0,2.0,,,10.0,0.0,16993.0,37.6,29.0,w,0.0,0.0,37798.29,37798.29,33100.0,4698.29,0.0,0.0,0.0,Oct-2016,23215.34,,Oct-2016,704.0,700.0,0.0,,1.0,Individual,,,,0.0,89.0,33172.0,,,,,,,,,,,,45200.0,,,,3.0,3686.0,8802.0,64.8,0.0,0.0,143.0,247.0,10.0,10.0,0.0,43.0,,5.0,,0.0,3.0,4.0,3.0,9.0,7.0,9.0,22.0,4.0,10.0,0.0,0.0,0.0,1.0,100.0,33.3,0.0,0.0,67371.0,33172.0,25000.0,22171.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
33164,66630831,,1000.0,1000.0,1000.0,36 months,11.22,32.85,B,B5,Sergeant,10+ years,RENT,40000.0,Verified,Dec-2015,Fully Paid,n,https://lendingclub.com/browse/loanDetail.action?loan_id=66630831,,vacation,Vacation,330xx,FL,18.27,0.0,Dec-2004,695.0,699.0,1.0,,,9.0,0.0,12175.0,39.8,13.0,w,0.0,0.0,1133.624827,1133.62,1000.0,133.62,0.0,0.0,0.0,Jun-2017,375.79,,Mar-2019,514.0,510.0,0.0,,1.0,Individual,,,,0.0,0.0,25333.0,,,,,,,,,,,,30600.0,,,,5.0,2815.0,17325.0,41.3,0.0,0.0,19.0,132.0,1.0,1.0,0.0,1.0,,1.0,,0.0,4.0,4.0,6.0,8.0,2.0,7.0,11.0,4.0,9.0,0.0,0.0,0.0,3.0,100.0,16.7,0.0,0.0,46125.0,25333.0,29500.0,15525.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
368586,41099404,,4400.0,4400.0,4400.0,36 months,12.29,146.76,C,C1,Housekeeping Lead / Purchasing,3 years,RENT,34000.0,Verified,Feb-2015,Fully Paid,n,https://lendingclub.com/browse/loanDetail.action?loan_id=41099404,,debt_consolidation,Debt consolidation,577xx,SD,8.86,0.0,Oct-2000,690.0,694.0,0.0,,,6.0,0.0,10915.0,78.0,17.0,f,0.0,0.0,5280.063823,5280.06,4400.0,880.06,0.0,0.0,0.0,Feb-2018,146.46,,Feb-2018,674.0,670.0,0.0,,1.0,Individual,,,,0.0,0.0,10915.0,,,,,,,,,,,,14000.0,,,,0.0,2183.0,1985.0,84.6,0.0,0.0,123.0,172.0,29.0,29.0,0.0,33.0,,,,0.0,4.0,4.0,4.0,9.0,3.0,6.0,14.0,4.0,6.0,0.0,0.0,0.0,0.0,100.0,100.0,0.0,0.0,14000.0,10915.0,12900.0,0.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,


***Dataframe shape***

In [9]:
rows, columns = sample_accepted_df.shape
print(f'Dataframe rows: {rows}')
print(f'Dataframe columns: {columns}')

Dataframe rows: 500000
Dataframe columns: 151


***Dataframe info***

In [10]:
sample_accepted_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 500000 entries, 50867 to 2071622
Columns: 151 entries, id to settlement_term
dtypes: float64(113), object(38)
memory usage: 579.8+ MB


Of the 151 columns, 113 are float64 and 38 are objects. The dataframe takes up approximatly 580 MB.
Note:
- The numeric columns are all float64 and the object columns. These columns can be optimized later to save memory space and decrease computation time by changing the datatypes.
- There is no datetime column.

***Describe Dataframe***

In [11]:
sample_accepted_df.describe()

Unnamed: 0,member_id,loan_amnt,funded_amnt,funded_amnt_inv,int_rate,installment,annual_inc,dti,delinq_2yrs,fico_range_low,fico_range_high,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_amnt,last_fico_range_high,last_fico_range_low,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,annual_inc_joint,dti_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_fico_range_low,sec_app_fico_range_high,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,deferral_term,hardship_amount,hardship_length,hardship_dpd,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,settlement_amount,settlement_percentage,settlement_term
count,0.0,499979.0,499979.0,499979.0,499979.0,499979.0,499977.0,499775.0,499968.0,499979.0,499979.0,499968.0,249429.0,87595.0,499968.0,499968.0,499979.0,499589.0,499968.0,499979.0,499979.0,499979.0,499979.0,499979.0,499979.0,499979.0,499979.0,499979.0,499979.0,499979.0,499979.0,499937.0,132999.0,499979.0,13712.0,13710.0,499968.0,478686.0,478686.0,271247.0,271247.0,271247.0,271247.0,264099.0,271247.0,234897.0,271247.0,271247.0,271247.0,271221.0,478686.0,271247.0,271247.0,271247.0,484924.0,478677.0,479425.0,479124.0,499937.0,499968.0,464881.0,478686.0,478686.0,478686.0,484924.0,479800.0,118418.0,438581.0,167810.0,478686.0,478686.0,478686.0,482303.0,478686.0,478686.0,478686.0,478686.0,478686.0,482303.0,458884.0,478686.0,478686.0,478686.0,478636.0,479306.0,499550.0,499949.0,478686.0,484924.0,484924.0,478686.0,10132.0,10133.0,10133.0,10133.0,10133.0,10133.0,9942.0,10133.0,10133.0,10133.0,10133.0,3621.0,2441.0,2441.0,2441.0,2441.0,1529.0,2441.0,2441.0,12888.0,12888.0,12888.0
mean,,14351.216601,14344.161255,14320.336357,13.376596,437.486043,76710.67,18.341618,0.320655,696.66044,700.660592,0.649706,34.321911,70.880358,11.586888,0.220104,16012.36,50.620056,24.792857,0.410202,0.410202,14464.588357,14439.874481,11969.671534,2247.591116,1.663389,245.662322,41.082691,5806.612793,676.828471,659.151394,0.018194,43.85527,1.0,117149.7,18.925354,0.005094,259.1667,142034.6,1.050213,2.793074,0.77863,1.75578,19.387608,35820.62,71.446042,1.423743,3.000737,5477.764517,58.046121,32983.95,1.099006,1.604361,2.316269,4.769896,13615.051745,10612.631953,58.495924,0.008885,18.332135,125.405467,180.279047,13.056344,7.752449,1.640129,23.542705,39.396274,6.661271,35.670908,0.518923,3.606805,5.572258,4.723707,7.946612,8.606907,8.238693,14.396429,5.512516,11.616187,0.000843,0.003411,0.089443,2.213589,94.093066,43.502846,0.139169,0.052663,175919.8,49903.54,21847.91,42729.52,31332.992696,665.931116,669.931116,0.753873,1.668114,11.328531,57.102826,2.973354,12.872298,0.064542,0.086944,36.460922,3.0,160.790098,3.0,14.01188,448.05932,11888.327054,193.17474,5138.769217,47.9167,13.879112
std,,8848.659132,8845.450081,8849.887219,4.909537,266.890831,74964.99,12.129001,0.889171,32.29301,32.293711,0.944267,21.902428,26.328297,5.528704,0.615178,22076.14,24.71912,12.026529,86.913731,86.913731,10352.301958,10351.93059,8988.21826,2483.88337,11.19051,955.659036,166.878407,7383.346534,81.355583,134.408802,0.151789,21.410147,0.0,60081.0,7.698941,0.078768,13508.88,158159.3,1.210044,2.983575,0.996272,1.690883,24.543063,42657.31,22.865883,1.578954,2.70155,5572.388855,20.989683,33585.04,1.59027,2.818583,2.54497,3.233913,16528.370021,15793.251067,28.656971,0.106895,964.250763,52.447051,94.777343,16.32297,8.600967,1.963966,30.709759,22.716797,5.828987,22.42478,1.354475,2.255307,3.305085,2.978037,4.768101,7.399445,4.597487,8.106795,3.206907,5.533984,0.030493,0.062124,0.51575,1.874267,8.854968,36.107364,0.38102,0.415262,178339.6,47837.23,21811.59,43491.52,26591.13827,46.076296,46.076296,1.090739,1.859156,6.55936,26.373974,3.223556,8.40273,0.484846,0.445223,24.279961,0.0,137.988481,0.0,9.765879,383.737777,7883.114894,208.897422,3734.172876,6.775596,8.075571
min,,500.0,500.0,0.0,5.31,4.93,0.0,0.0,0.0,610.0,614.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,-5.1e-09,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,11000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,540.0,544.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,1.47,3.0,0.0,4.41,174.15,0.01,107.0,0.45,0.0
25%,,7500.0,7500.0,7500.0,9.8,243.75,46000.0,11.76,0.0,670.0,674.0,0.0,16.0,54.0,8.0,0.0,5749.0,31.9,16.0,0.0,0.0,6513.171022,6494.55,5000.0,690.37,0.0,0.0,0.0,433.54,629.0,625.0,0.0,27.0,1.0,80000.0,13.34,0.0,0.0,29470.25,0.0,1.0,0.0,1.0,6.0,9613.0,58.0,0.0,1.0,2143.0,44.0,14100.0,0.0,0.0,1.0,2.0,3113.0,1576.0,36.1,0.0,0.0,97.0,117.0,4.0,3.0,0.0,6.0,20.0,2.0,17.0,0.0,2.0,3.0,3.0,5.0,4.0,5.0,9.0,3.0,8.0,0.0,0.0,0.0,1.0,91.3,0.0,0.0,0.0,50212.0,20847.0,7800.0,15000.0,13906.0,640.0,644.0,0.0,0.0,7.0,38.0,1.0,7.0,0.0,0.0,14.0,3.0,59.53,3.0,5.0,164.58,5669.43,41.01,2288.5875,45.0,8.0
50%,,12000.0,12000.0,12000.0,12.79,371.59,65000.0,17.62,0.0,690.0,694.0,0.0,31.0,72.0,11.0,0.0,10871.0,50.7,23.0,0.0,0.0,11712.538765,11687.43,10000.0,1456.78,0.0,0.0,0.0,2591.91,694.0,690.0,0.0,44.0,1.0,107000.0,18.54,0.0,0.0,80845.0,1.0,2.0,1.0,1.0,12.0,24058.0,75.0,1.0,2.0,4150.0,59.0,24200.0,1.0,0.0,2.0,4.0,7474.0,4998.0,61.2,0.0,0.0,129.0,163.0,8.0,5.0,1.0,13.0,37.0,5.0,33.0,0.0,3.0,5.0,4.0,7.0,7.0,7.0,13.0,5.0,11.0,0.0,0.0,0.0,2.0,97.9,40.0,0.0,0.0,113814.5,37461.0,15300.0,32244.0,24688.5,665.0,669.0,0.0,1.0,10.0,59.05,2.0,11.0,0.0,0.0,35.0,3.0,118.5,3.0,16.0,339.36,10403.83,123.01,4251.5,45.0,16.0
75%,,20000.0,20000.0,20000.0,16.02,581.09,91000.0,24.1,0.0,715.0,719.0,1.0,50.0,90.0,14.0,0.0,19376.0,69.7,32.0,0.0,0.0,20096.100096,20072.905,16925.0,2851.34,0.0,0.0,0.0,8861.67,734.0,730.0,0.0,61.0,1.0,140000.0,24.19,0.0,0.0,212503.8,2.0,3.0,1.0,3.0,22.0,46719.5,87.0,2.0,4.0,7109.0,73.0,40900.0,2.0,2.0,3.0,6.0,18877.0,13007.0,83.8,0.0,0.0,152.0,229.0,16.0,10.0,3.0,28.0,58.0,10.0,52.0,0.0,5.0,7.0,6.0,10.0,11.0,10.0,19.0,7.0,14.0,0.0,0.0,0.0,3.0,100.0,75.0,0.0,0.0,254610.8,63007.0,28400.0,57499.0,40304.0,690.0,694.0,1.0,3.0,15.0,78.4,4.0,17.0,0.0,0.0,56.0,3.0,219.63,3.0,22.0,615.18,16484.96,277.33,7017.6025,50.0,18.0
max,,40000.0,40000.0,40000.0,30.99,1719.83,9550000.0,999.0,30.0,845.0,850.0,32.0,202.0,124.0,90.0,86.0,1743266.0,366.6,169.0,28222.01,28222.01,63296.877917,63296.88,40000.0,28192.5,1188.83,39444.37,6687.6228,42192.05,850.0,845.0,14.0,202.0,1.0,1837000.0,69.49,14.0,9152545.0,8000078.0,17.0,48.0,25.0,51.0,511.0,1547285.0,353.0,26.0,53.0,776843.0,184.0,2013133.0,48.0,79.0,49.0,56.0,958084.0,559912.0,339.6,8.0,249925.0,649.0,818.0,438.0,314.0,37.0,639.0,202.0,25.0,202.0,51.0,33.0,51.0,63.0,70.0,138.0,83.0,128.0,44.0,90.0,3.0,4.0,30.0,29.0,100.0,100.0,8.0,85.0,9999999.0,1924200.0,1105500.0,2000000.0,357135.0,840.0,844.0,6.0,18.0,82.0,182.5,35.0,92.0,16.0,16.0,132.0,3.0,943.94,3.0,32.0,2267.28,39542.45,1407.86,27850.0,184.36,118.0


Some key points:

- Loan Amount
  
    - Average Loan Amount is ~ 15,000 USD with a standard deviation of 9240 USD, having a max of 40,000 USD and minimum of 500 USD. This follows LendingClubs  policies for minimum and maximum loan amounts.

- Funded amount
    - Nearly identical to the loan amount

- Funded amount by investors
    - Very similar to the  funded amount

- Interest Rate
    - The interest rates are quite high. An average of 13%, with a minimum of 5.3% and a maximum of 31%.


   

***Null Values***

Some rows are fully NaN values, aside from the id. This will cause issues trying to inspect each column later. So we will drop `id` and the NaN rows, along with any other irrelevant columns including:  
- member_id
- url for the loan
- LC policy code
- title (information is already found under purpose)
- initial_list_status (what market it was listed under)

Define a list to keep track of the columns we have dropped.

In [12]:
dropped_columns = []

In [13]:
drop_columns=['id', 'member_id', 'url', 'policy_code', 'title', 'initial_list_status']

# Append the columns to drop
dropped_columns.extend(drop_columns)

sample_accepted_df.drop(columns=drop_columns, inplace=True)

In [14]:
null_rows = sample_accepted_df.isnull().all(axis=1).sum()
print(f"Number of Null rows: {null_rows}")

Number of Null rows: 21


In [15]:
# Drop rows that are all Nan
sample_accepted_df.dropna(how='all', inplace=True)

In [16]:
null_rows = sample_accepted_df.isnull().all(axis=1).sum()
print(f"Number of Null rows: {null_rows}")

Number of Null rows: 0


&nbsp;

---------------------------------------------

### Feature Pruning

Exclude any leaky features, non relevant features and any features that were not present in the original loan application.

#### ***Irrelevant columns***

***Secondary Applicants Information***

The columns for the secondary applicants are largely nulls, so we will drop them.

In [17]:
nulls_percent = (sample_accepted_df['sec_app_mort_acc'].isnull().sum()/sample_accepted_df.shape[0]*100)
print('Percentage of null rows for secondary applicants: ', nulls_percent.round(2), '%')

Percentage of null rows for secondary applicants:  97.97 %


In [18]:
sample_accepted_df['application_type'].value_counts()

application_type
Individual    486267
Joint App      13712
Name: count, dtype: int64

In [19]:
# Get the index of the loans where application_type is a joint application
rows_to_remove = sample_accepted_df.loc[sample_accepted_df['application_type'] == 'Joint App'].index

# Drop the loans
sample_accepted_df.drop(rows_to_remove, inplace=True)

# Drop the related columns
drop_columns = ['revol_bal_joint', 'sec_app_fico_range_low', 
                'sec_app_fico_range_high', 'sec_app_earliest_cr_line',
                'sec_app_inq_last_6mths', 'sec_app_mort_acc',
                'sec_app_open_acc', 'sec_app_revol_util', 
                'sec_app_open_act_il', 'sec_app_num_rev_accts', 
                'sec_app_chargeoff_within_12_mths', 'sec_app_collections_12_mths_ex_med',
                'sec_app_mths_since_last_major_derog',
                'verification_status_joint', 'dti_joint',
                'annual_inc_joint']

# Append the columns to drop
dropped_columns.extend(drop_columns)
sample_accepted_df.drop(columns=drop_columns, inplace=True)

We will still keep the flag of whether the application was a joint or individual application since there are no nulls, and joint applicant loans are typically used when the primary applicant has bad or little credit history. 

***Hardship Loans***

Hardship loans add 15 columns of complexity, are largely nulls and leak the loan outcome. We will drop these columns and loans if they exist in our dataset, and limit our analysis to non hardship loans.

In [20]:
# Fetch the value counts for the for the hardships flags
hardships = sample_accepted_df['hardship_flag'].value_counts()
display(hardships)

# If there are loans with the yes hardship flag
if 'Y' in hardships:
    #get the count of hardship loans
    yes_hardship_count = hardships.iloc[1]
    print(f'The hardship loans represent only {(yes_hardship_count/sample_accepted_df.shape[0])*100}% of the dataset')

    #get the index of the hardship loans
    rows_to_remove = sample_accepted_df.loc[sample_accepted_df['hardship_flag'] == 'Y'].index

    #drop the loans
    sample_accepted_df.drop(rows_to_remove, inplace=True)

    #check the rows have been dropped
    assert sample_accepted_df['hardship_flag'].value_counts().shape[0] == 1
    print('Hardship loans and associated columns have been dropped')

else:
    print('There are no hardship loans.')
    
drop_columns = ['hardship_flag', 'hardship_type',
                'hardship_reason', 'hardship_status',
                'hardship_amount', 'hardship_start_date',
                'hardship_end_date', 'deferral_term',
                'hardship_length', 'hardship_dpd',
                'hardship_loan_status', 'payment_plan_start_date',
                'orig_projected_additional_accrued_interest', 'hardship_payoff_balance_amount',
                'hardship_last_payment_amount']

# Append the columns to drop
dropped_columns.extend(drop_columns)
sample_accepted_df.drop(columns = drop_columns, inplace=True)
print('Hardship columns have been dropped')

hardship_flag
N    486267
Name: count, dtype: int64

There are no hardship loans.
Hardship columns have been dropped


***Employee Title***

In [21]:
unique_emp_titles = sample_accepted_df['emp_title'].nunique()
print(f'Number of unique employment titles: {unique_emp_titles}')

Number of unique employment titles: 160993


There are too many unique Employee titles to attempt any sort of grouping or encoding for now. In the future we could use NLP or an external API to group the Employee Title.

In [22]:
# Append the columns to drop
dropped_columns.append('emp_title')

sample_accepted_df.drop(columns = 'emp_title', inplace=True)

***Loan Status***

Any current loans have already been dropped when reading in the data. We can now finish grouping the completed loans.

More information on the loan status's can be found here:  
https://www.lendingclub.com/help/investing-faq/what-do-the-different-note-statuses-mean  
https://www.fintechnexus.com/policy-code-2-loans-lending-club/

In [23]:
sample_accepted_df['loan_status'].value_counts()

loan_status
Fully Paid                                             385260
Charged Off                                            100177
Does not meet the credit policy. Status:Fully Paid        582
Does not meet the credit policy. Status:Charged Off       230
Default                                                    18
Name: count, dtype: int64

The "Does not meet the credit policy" means when the loans were made under a different credit card policy, that does not meet the current policy. This has affect on the loans themselves, so they can be grouped with their counter parts. Charged off and Defaulted can also been grouped together.

In [24]:
status_mapping = {
    "Fully Paid": "Fully Paid",
    "Does not meet the credit policy. Status:Fully Paid": "Fully Paid",
    "Does not meet the credit policy. Status:Charged Off": "Charged Off/Default",
    "Charged Off": "Charged Off/Default",
    "Default": "Charged Off/Default",
}

# Map the loans
sample_accepted_df['loan_status'] = sample_accepted_df['loan_status'].map(status_mapping)

Check the mapping has worked:

In [25]:
sample_accepted_df['loan_status'].value_counts()

loan_status
Fully Paid             385842
Charged Off/Default    100425
Name: count, dtype: int64

***State / Zip Code***

We have 2 geographical features. We will drop both of them later as they add too much complexity to the model. However, in the future we can perhaps use a 3rd party api and introduce mean or median income data by region, allowing us to capture some of that geographical data. These models were kept for the sake of EDA, they will be dropped then. 

In [26]:
display(sample_accepted_df['addr_state'].value_counts())
print('-'*20)
display(sample_accepted_df['zip_code'].value_counts())

addr_state
CA    70894
TX    39945
NY    39754
FL    35386
IL    18586
NJ    17365
PA    16388
GA    15735
OH    15706
NC    13744
VA    13507
MI    12848
AZ    11935
MD    11289
MA    11202
CO    10749
WA    10323
MN     8614
IN     7935
MO     7623
NV     7459
TN     7451
CT     7115
WI     6190
AL     5995
OR     5832
SC     5721
LA     5514
KY     4678
OK     4327
KS     3927
AR     3622
UT     3518
NM     2602
HI     2475
MS     2468
NH     2406
RI     2101
NE     1498
WV     1495
DE     1418
MT     1359
DC     1249
AK     1096
WY     1000
SD      975
VT      955
ME      810
ID      788
ND      691
IA        4
Name: count, dtype: int64

--------------------


zip_code
945xx    5448
750xx    5246
112xx    4962
606xx    4470
300xx    4438
331xx    4105
891xx    3963
070xx    3848
900xx    3835
770xx    3810
100xx    3766
330xx    3738
104xx    3526
917xx    3502
117xx    3364
852xx    3179
921xx    3023
925xx    2966
913xx    2868
926xx    2833
334xx    2701
481xx    2597
853xx    2546
956xx    2466
601xx    2457
760xx    2432
604xx    2429
113xx    2392
080xx    2373
021xx    2358
928xx    2349
773xx    2342
802xx    2335
301xx    2320
600xx    2304
207xx    2304
920xx    2302
850xx    2252
980xx    2242
774xx    2160
201xx    2115
782xx    2106
923xx    2100
480xx    2034
114xx    2022
902xx    2009
800xx    2008
951xx    1981
212xx    1962
941xx    1956
554xx    1932
333xx    1924
088xx    1903
275xx    1897
953xx    1893
328xx    1886
553xx    1882
327xx    1855
752xx    1845
775xx    1827
840xx    1819
958xx    1795
786xx    1778
605xx    1755
890xx    1750
967xx    1742
787xx    1740
303xx    1730
302xx    1713
940xx    1686
282xx    16

In [27]:
#drop_columns = ['addr_state', 'zip_code']

# append the columns to drop
#dropped_columns.extend(drop_columns)
#sample_accepted_df.drop(columns = drop_columns, inplace=True)

***Description***

In [28]:
unique_desc_titles = sample_accepted_df['desc'].nunique()
print(f'Number of unique descriptions: {unique_desc_titles}')

Number of unique descriptions: 38325


There are too many unique descriptions to create dummy variables. We can drop this column

In [29]:
drop_columns = ['desc']

# Append the columns to drop
dropped_columns.extend(drop_columns)
sample_accepted_df.drop(columns = drop_columns, inplace=True)

#### Leaky columns

Remove any columns that can leak the outcome of the application ie, any data the originates after a loan has been funded or rejected.  

In [30]:
print('Columns dropped so far: ')
print(dropped_columns)

Columns dropped so far: 
['id', 'member_id', 'url', 'policy_code', 'title', 'initial_list_status', 'revol_bal_joint', 'sec_app_fico_range_low', 'sec_app_fico_range_high', 'sec_app_earliest_cr_line', 'sec_app_inq_last_6mths', 'sec_app_mort_acc', 'sec_app_open_acc', 'sec_app_revol_util', 'sec_app_open_act_il', 'sec_app_num_rev_accts', 'sec_app_chargeoff_within_12_mths', 'sec_app_collections_12_mths_ex_med', 'sec_app_mths_since_last_major_derog', 'verification_status_joint', 'dti_joint', 'annual_inc_joint', 'hardship_flag', 'hardship_type', 'hardship_reason', 'hardship_status', 'hardship_amount', 'hardship_start_date', 'hardship_end_date', 'deferral_term', 'hardship_length', 'hardship_dpd', 'hardship_loan_status', 'payment_plan_start_date', 'orig_projected_additional_accrued_interest', 'hardship_payoff_balance_amount', 'hardship_last_payment_amount', 'emp_title', 'desc']


***Loan Grade***

Loan grade is calculated after the loan is given, so we can drop both `grade` and `sub_grade`.

In [31]:
drop_columns = ['grade','sub_grade']

# Append the columns to drop
dropped_columns.extend(drop_columns)
sample_accepted_df.drop(columns=drop_columns, inplace=True)

***Other features to drop***

We can remove any columns that:  
- describe payments made toward the loan

In [32]:
drop_columns =  ['total_pymnt', 'total_rec_prncp',
                 'total_rec_int', 'total_rec_late_fee',
                 'last_pymnt_d', 'last_pymnt_amnt', 
                 'next_pymnt_d', 'total_pymnt_inv']

# Append the columns to drop
dropped_columns.extend(drop_columns)
sample_accepted_df.drop(columns =drop_columns, inplace=True)

- describe debt collection or recovery

In [33]:
drop_columns = ['collection_recovery_fee', 
                'collection_recovery_fee', 'recoveries']

dropped_columns.extend(drop_columns)
sample_accepted_df.drop(columns =drop_columns, inplace=True)

- loan attributes post acceptance

In [34]:
drop_columns=['out_prncp', 'out_prncp_inv',
              'pymnt_plan', 'disbursement_method',
              'last_credit_pull_d',
              'debt_settlement_flag_date', 'settlement_term',
              'num_tl_120dpd_2m', 'num_tl_30dpd']

dropped_columns.extend(drop_columns)
sample_accepted_df.drop(columns=drop_columns, inplace=True)

- any settlement information

In [35]:
drop_columns=['debt_settlement_flag', 'settlement_status',
              'settlement_date', 'settlement_amount',
              'settlement_percentage']

dropped_columns.extend(drop_columns)
sample_accepted_df.drop(columns=drop_columns, inplace=True)

- other columns

In [36]:
drop_columns = ['max_bal_bc', 'open_rv_24m',
                'open_rv_12m', 'inq_fi',
                'total_bal_il', 'inq_last_12m',
                'open_il_24m', 'open_il_12m',
                'open_act_il', 'total_cu_tl',
                'open_acc_6m', 'il_util','mths_since_rcnt_il',
                'all_util']

dropped_columns.extend(drop_columns)
sample_accepted_df.drop(columns=drop_columns, inplace=True)

### Feature Engineering

***Term***

Convert from str to int

In [37]:
sample_accepted_df['term'].value_counts()

term
 36 months    372096
 60 months    114171
Name: count, dtype: int64

In [38]:
# Remove the "months" text and convert to int
sample_accepted_df['term'] = sample_accepted_df['term'].str.extract('(\d+)').astype('int32')

In [39]:
sample_accepted_df['term'].value_counts()

term
36    372096
60    114171
Name: count, dtype: int64

***Emp_Length***

As employment length is ordinal, we will map greater than 10 years to 10, less than 1 year to 0.5 as to differentiate it between 1 and 0, preserving that information.

In [40]:
sample_accepted_df['emp_length'].value_counts()

emp_length
10+ years    161376
2 years       44649
3 years       39222
< 1 year      37298
1 year        32825
5 years       30412
4 years       29333
6 years       22774
8 years       21137
7 years       20708
9 years       18437
Name: count, dtype: int64

We will assume NA's as no employment

In [41]:
sample_accepted_df['emp_length'].fillna(value='0',inplace=True)

Apply the mapping

In [42]:
sample_accepted_df['emp_length'] = sample_accepted_df['emp_length'].apply(map_emp_length)

Check employment length has been updated

In [43]:
sample_accepted_df['emp_length'].value_counts()

emp_length
10.0    161376
2.0      44649
3.0      39222
0.5      37298
1.0      32825
5.0      30412
4.0      29333
0.0      28096
6.0      22774
8.0      21137
7.0      20708
9.0      18437
Name: count, dtype: int64

### Dataframe-Null-Values

------------------------------------------

In [44]:
pd.set_option('display.max_rows', None)

We can calculate the percentages of null values by column

In [45]:
(sample_accepted_df.isnull().sum()/sample_accepted_df.shape[0]*100).sort_values(ascending=False)

mths_since_last_record            82.486371
mths_since_recent_bc_dlq          76.261807
mths_since_last_major_derog       73.318156
mths_since_recent_revol_delinq    66.375880
mths_since_last_delinq            50.049870
mths_since_recent_inq             12.350622
mo_sin_old_il_acct                 7.135792
pct_tl_nvr_dlq                     4.389152
avg_cur_bal                        4.380721
total_rev_hi_lim                   4.378870
tot_cur_bal                        4.378870
mo_sin_old_rev_tl_op               4.378870
mo_sin_rcnt_rev_tl_op              4.378870
mo_sin_rcnt_tl                     4.378870
num_accts_ever_120_pd              4.378870
num_actv_bc_tl                     4.378870
num_actv_rev_tl                    4.378870
num_bc_tl                          4.378870
num_il_tl                          4.378870
num_op_rev_tl                      4.378870
num_rev_accts                      4.378870
num_rev_tl_bal_gt_0                4.378870
num_tl_90g_dpd_24m              

Note how there seems to be groupings of nulls. We will explore these groupings

&nbsp;

***Explore the groupings of nulls***

We will drop the loans for columns with less than 3% nulls. With a dataset this size, a few loans won't affect our analysis. Furthermore, majority of the features within these rows are nulls. There is a noticable gap at 3% which is why we will we choose it for our cutoff.

In [46]:
cutoff = 3 # Percent

# Get the percentages of nulls for each column 
null_percentages = (sample_accepted_df.isnull().sum() / sample_accepted_df.shape[0]) * 100

# Get the filtered columns
filtered_columns = null_percentages[null_percentages < cutoff].index.tolist()

#Drop the loans with nulls for the filtered columns
sample_accepted_df_cleaned = sample_accepted_df.dropna(subset=filtered_columns, inplace=True)

We will also drop any columns that are majority nulls.

In [47]:
column_cutoff = 10 # Percent

# Get the percentages of nulls for each column 
null_percentages = (sample_accepted_df.isnull().sum() / sample_accepted_df.shape[0]) * 100

# Get the filtered columns
filtered_columns = null_percentages[null_percentages > column_cutoff].index.tolist()
dropped_columns.extend(filtered_columns)

# Drop the filtered columns
sample_accepted_df.drop(columns=filtered_columns, inplace=True)

This leaves us with the the following column nulls:

In [48]:
(sample_accepted_df.isnull().sum()/sample_accepted_df.shape[0]*100).sort_values(ascending=False)

mo_sin_old_il_acct            7.050945
pct_tl_nvr_dlq                4.301507
total_rev_hi_lim              4.291208
num_actv_rev_tl               4.291208
avg_cur_bal                   4.291208
mo_sin_old_rev_tl_op          4.291208
mo_sin_rcnt_rev_tl_op         4.291208
mo_sin_rcnt_tl                4.291208
num_accts_ever_120_pd         4.291208
num_actv_bc_tl                4.291208
num_bc_tl                     4.291208
tot_coll_amt                  4.291208
num_il_tl                     4.291208
num_op_rev_tl                 4.291208
num_rev_accts                 4.291208
num_rev_tl_bal_gt_0           4.291208
num_tl_90g_dpd_24m            4.291208
num_tl_op_past_12m            4.291208
tot_hi_cred_lim               4.291208
tot_cur_bal                   4.291208
total_il_high_credit_limit    4.291208
bc_util                       4.064217
percent_bc_gt_75              4.038469
bc_open_to_buy                4.013957
mths_since_recent_bc          3.945366
num_bc_sats              

We can now work through each grouping, starting with the samllest.

***acc_open_past_24mths***

In [49]:
null_rows = sample_accepted_df[sample_accepted_df['acc_open_past_24mths'].isnull()]
null_rows.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,purpose,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,fico_range_low,fico_range_high,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,last_fico_range_high,last_fico_range_low,collections_12_mths_ex_med,application_type,acc_now_delinq,tot_coll_amt,tot_cur_bal,total_rev_hi_lim,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
1631890,4450.0,4450.0,4450.0,36,10.37,144.37,4.0,MORTGAGE,45000.0,Source Verified,Feb-2011,Fully Paid,credit_card,975xx,OR,21.89,0.0,Feb-2005,695.0,699.0,1.0,12.0,0.0,8809.0,66.7,17.0,699.0,695.0,0.0,Individual,0.0,,,,,,,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,
1618653,14000.0,14000.0,13750.0,36,7.9,438.07,4.0,RENT,82000.0,Source Verified,Sep-2011,Fully Paid,debt_consolidation,073xx,NJ,4.17,0.0,Jan-2000,740.0,744.0,0.0,6.0,0.0,13608.0,56.0,30.0,739.0,735.0,0.0,Individual,0.0,,,,,,,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,
1642767,6000.0,6000.0,6000.0,36,7.88,187.69,6.0,MORTGAGE,62000.0,Not Verified,Apr-2010,Fully Paid,debt_consolidation,123xx,NY,24.72,0.0,Jul-1995,725.0,729.0,0.0,13.0,0.0,26313.0,52.3,52.0,729.0,725.0,0.0,Individual,0.0,,,,,,,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,
1629383,15000.0,15000.0,15000.0,60,15.28,359.06,0.0,RENT,30000.0,Verified,Apr-2011,Fully Paid,other,926xx,CA,6.96,0.0,Apr-2004,725.0,729.0,0.0,3.0,0.0,118.0,54.1,13.0,699.0,695.0,0.0,Individual,0.0,,,,,,,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,
1641557,2400.0,2400.0,2400.0,36,9.88,77.31,4.0,MORTGAGE,69684.0,Not Verified,May-2010,Fully Paid,vacation,951xx,CA,5.31,0.0,Oct-2004,755.0,759.0,0.0,3.0,0.0,1899.0,38.0,9.0,714.0,710.0,0.0,Individual,0.0,,,,,,,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,


In [50]:
null_rows['issue_d'].value_counts()

issue_d
Feb-2012    770
Jan-2012    770
Dec-2011    715
Mar-2012    634
Oct-2011    631
Nov-2011    616
Aug-2011    614
Sep-2011    601
Jul-2011    581
Jun-2011    559
May-2011    532
Apr-2011    455
Mar-2011    451
Jan-2011    434
Dec-2010    422
Feb-2011    398
Aug-2010    365
Oct-2010    359
Sep-2010    355
Jul-2010    343
Nov-2010    326
Jun-2010    325
May-2010    284
Apr-2010    271
Mar-2010    259
Nov-2009    210
Jan-2010    205
Feb-2010    192
Dec-2009    190
Oct-2009    184
Sep-2009    157
Jul-2009    133
Jun-2009    130
May-2009    126
Aug-2009    121
Apr-2009     99
Mar-2009     99
Feb-2009     96
Jan-2009     85
Dec-2008     77
Apr-2008     72
Mar-2008     66
Nov-2008     57
Jul-2008     50
Oct-2008     41
Jun-2008     40
May-2008     33
Aug-2008     28
Sep-2008     17
Apr-2012     15
Feb-2008      4
Dec-2007      2
Jan-2008      1
Name: count, dtype: int64

Notice the date for the loans. The loans made early in lendingclubs history make up the majority of nulls for the remaining columns. This is impart due to lendingclub frequently updating their api, adding new fields, while the loans that are already recorded are filled with NaN values. Since our analysis is based on the exact combination of features for a loan, it simply does not make sense to keep these loans as there is no accurate way to impute the many missing values. We can remove the associated rows or features. Although this may add some recency bias, as we are narrowing our analysis to more recent loans that may not have as varied economic conditions among other factors, we will drop the rows due to our dataset size.  
Example:  
https://www.fintechnexus.com/lending-club-adds-15-new-fields-and-folio-introduces-a-true-secondary-market-api/

In [51]:
sample_accepted_df.dropna(subset=['annual_inc', 'total_acc', 
                                  'tax_liens', 'chargeoff_within_12_mths', 
                                  'pub_rec_bankruptcies', 'total_bal_ex_mort',
                                  'tot_hi_cred_lim', 'avg_cur_bal', 
                                  'pct_tl_nvr_dlq', 'mo_sin_old_il_acct', 'bc_util', 'percent_bc_gt_75'
                                 ], inplace=True)

In [52]:
(sample_accepted_df.isnull().sum()/sample_accepted_df.shape[0]*100).sort_values(ascending=False)

loan_amnt                     0.0
num_bc_sats                   0.0
avg_cur_bal                   0.0
bc_open_to_buy                0.0
bc_util                       0.0
chargeoff_within_12_mths      0.0
delinq_amnt                   0.0
mo_sin_old_il_acct            0.0
mo_sin_old_rev_tl_op          0.0
mo_sin_rcnt_rev_tl_op         0.0
mo_sin_rcnt_tl                0.0
mort_acc                      0.0
mths_since_recent_bc          0.0
num_accts_ever_120_pd         0.0
num_actv_bc_tl                0.0
num_actv_rev_tl               0.0
num_bc_tl                     0.0
funded_amnt                   0.0
num_il_tl                     0.0
num_op_rev_tl                 0.0
num_rev_accts                 0.0
num_rev_tl_bal_gt_0           0.0
num_sats                      0.0
num_tl_90g_dpd_24m            0.0
num_tl_op_past_12m            0.0
pct_tl_nvr_dlq                0.0
percent_bc_gt_75              0.0
pub_rec_bankruptcies          0.0
tax_liens                     0.0
tot_hi_cred_li

We have no more null values

### Dataframe Optimization

The library used to optimize the dataframe shape minimize to int8, which is not supported by parquet files. The code is left for reusability in the case someone wants to export as csv.

In [53]:
#sample_accepted_df = pdc.downcast(sample_accepted_df)
#print(sample_accepted_df.info())
# Infer minimum schema for DataFrame.
#schema = pdc.infer_schema(sample_accepted_df)
#print(schema)
#sample_accepted_df.shape

Instead we will simply downcast the datatypes to int and float32 as this is more than enough precision for our data. 

In [54]:
# Downcast all float columns to float32
float_cols = sample_accepted_df.select_dtypes(include='float64').columns
for col in float_cols:
    sample_accepted_df[col] = pd.to_numeric(sample_accepted_df[col], downcast='float')

# Downcast all int columns to int32
int_cols = sample_accepted_df.select_dtypes(include='int64').columns
for col in int_cols:
    sample_accepted_df[col] = sample_accepted_df[col].astype('int32')


In [55]:
sample_accepted_df.dtypes

loan_amnt                     float32
funded_amnt                   float32
funded_amnt_inv               float64
term                            int32
int_rate                      float32
installment                   float32
emp_length                    float32
home_ownership                 object
annual_inc                    float64
verification_status            object
issue_d                        object
loan_status                    object
purpose                        object
zip_code                       object
addr_state                     object
dti                           float32
delinq_2yrs                   float32
earliest_cr_line               object
fico_range_low                float32
fico_range_high               float32
inq_last_6mths                float32
open_acc                      float32
pub_rec                       float32
revol_bal                     float32
revol_util                    float32
total_acc                     float32
last_fico_ra

### Export Dataframe

***Export the dataframe for EDA***

In [56]:
export_destination = Path('../Cleaned_Data/eda_cleaned')
sample_accepted_df.to_parquet(export_destination)
print('Cleaned data to be used for EDA has been exported')

Cleaned data to be used for EDA has been exported


***Export the dataframe for Models***

Drop any Leaky columns left over from EDA

In [57]:
# Drop leaky columns / unwanted columns
drop_columns=['funded_amnt', 'funded_amnt_inv', 'fico_range_low', 'fico_range_high', 'last_fico_range_high', 'last_fico_range_low']
dropped_columns.extend(drop_columns)
sample_accepted_df.drop(columns = drop_columns, inplace=True)

# Drop categorical columns with too many categories for one hot encoding
drop_columns=['issue_d', 'earliest_cr_line', 'zip_code', 'addr_state']
dropped_columns.extend(drop_columns)
sample_accepted_df.drop(columns = drop_columns, inplace=True)

print('The final list of columns dropped : ')
print(dropped_columns)

The final list of columns dropped : 
['id', 'member_id', 'url', 'policy_code', 'title', 'initial_list_status', 'revol_bal_joint', 'sec_app_fico_range_low', 'sec_app_fico_range_high', 'sec_app_earliest_cr_line', 'sec_app_inq_last_6mths', 'sec_app_mort_acc', 'sec_app_open_acc', 'sec_app_revol_util', 'sec_app_open_act_il', 'sec_app_num_rev_accts', 'sec_app_chargeoff_within_12_mths', 'sec_app_collections_12_mths_ex_med', 'sec_app_mths_since_last_major_derog', 'verification_status_joint', 'dti_joint', 'annual_inc_joint', 'hardship_flag', 'hardship_type', 'hardship_reason', 'hardship_status', 'hardship_amount', 'hardship_start_date', 'hardship_end_date', 'deferral_term', 'hardship_length', 'hardship_dpd', 'hardship_loan_status', 'payment_plan_start_date', 'orig_projected_additional_accrued_interest', 'hardship_payoff_balance_amount', 'hardship_last_payment_amount', 'emp_title', 'desc', 'grade', 'sub_grade', 'total_pymnt', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'last_pymnt_

Map **Successful loans to 1**, and **Defaulted or Charged Off loans to 0** in our target column.

In [58]:
sample_accepted_df['loan_status'] = sample_accepted_df['loan_status'].apply(lambda x: 1 if x == 'Fully Paid' else 0)

In [59]:
export_destination = Path('../Cleaned_Data/model_cleaned')
sample_accepted_df.to_parquet(export_destination)
print('Cleaned data to be used for modelling has been exported')

Cleaned data to be used for modelling has been exported


### Conclusion

In this notebook, we have completed a rudimentary cleaning of the lendingclub dataset. A random sample of 500,000 rows was taken from the dataset and cleaned. We have have dealt with any missing information stemming from changes in lendingclub's api over the years, and any other NaN values. Any features that could leak the outcome of the loan, were irrelevant, or added unnecessary complexity were have also been dropped. Some rudimentary feature engineering has been conducted but this will be expanded on later. Finally, the cleaned dataset is written to a parquet file.   

Note:   
- Although the leaky features were carefully reviewed, we will check the feature weights when performing our baseline logitist regression model to confirm that some leaky features haven't been kept. 
- We have removed the earlier loans completed in lendingclubs history due to api changes and large amount of null values. This restricts our dataset to a more recent timeframe, which could introduce a recency bias, given how sensitive loans are on economic conditions over this shorter period, as will be shown in EDA.

### Resources used:

- https://stackoverflow.com/questions/51325601/how-to-stop-my-pandas-data-table-from-being-truncated-when-printed