# Exploratory Data Analysis and Cleaning

## Author: 
## Date: OCT 10, 2023

### Table of contents

### Introduction

This notebook does the data cleaning for the accepted loans. Due to the dataset size, a sample CSV file is used to increase performance.

### Data Dictionary

- how much data is lost per column due to cleaning
- no data truncation subprocess, load from drive
- explain why data is missing etc
- 

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from pathlib import Path

#### Load in the data

Place data within the Data

In [10]:
#pathlib is used to ensure compatibility across operating systems
try:
    data_destination = Path('../Data/Lending_club/sample_accepted_2007_to_2018Q4.csv')
    sample_accepted_df = pd.read_csv(data_destination)
except FileNotFoundError as e:
    print(e.args[1])
    print('Check file location')

  sample_accepted_df = pd.read_csv(data_destination)


In [16]:
sample_accepted_df.head(5)

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,68407277,,3600.0,3600.0,3600.0,36 months,13.99,123.03,C,C4,...,,,Cash,N,,,,,,
1,68355089,,24700.0,24700.0,24700.0,36 months,11.99,820.28,C,C1,...,,,Cash,N,,,,,,
2,68341763,,20000.0,20000.0,20000.0,60 months,10.78,432.66,B,B4,...,,,Cash,N,,,,,,
3,66310712,,35000.0,35000.0,35000.0,60 months,14.85,829.9,C,C5,...,,,Cash,N,,,,,,
4,68476807,,10400.0,10400.0,10400.0,60 months,22.45,289.91,F,F1,...,,,Cash,N,,,,,,


The are too many columns to analyze at once

#### Seperate Columns by datatype

In [18]:
numeric_sample_accepted_df = sample_accepted_df.select_dtypes(include=['number'])
numeric_sample_accepted_df

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,int_rate,installment,annual_inc,dti,delinq_2yrs,...,deferral_term,hardship_amount,hardship_length,hardship_dpd,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,settlement_amount,settlement_percentage,settlement_term
0,68407277,,3600.0,3600.0,3600.0,13.99,123.03,55000.0,5.91,0.0,...,,,,,,,,,,
1,68355089,,24700.0,24700.0,24700.0,11.99,820.28,65000.0,16.06,1.0,...,,,,,,,,,,
2,68341763,,20000.0,20000.0,20000.0,10.78,432.66,63000.0,10.78,0.0,...,,,,,,,,,,
3,66310712,,35000.0,35000.0,35000.0,14.85,829.90,110000.0,17.06,0.0,...,,,,,,,,,,
4,68476807,,10400.0,10400.0,10400.0,22.45,289.91,104433.0,25.37,1.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19995,64360310,,12600.0,12600.0,12600.0,11.99,280.22,104000.0,15.06,1.0,...,,,,,,,,,,
19996,67475682,,29825.0,29825.0,29825.0,13.67,688.89,140000.0,11.93,0.0,...,,,,,,,,,,
19997,67245737,,6000.0,6000.0,6000.0,10.64,195.42,88000.0,24.64,1.0,...,,,,,,,,,,
19998,67265833,,19000.0,19000.0,19000.0,10.64,618.81,75000.0,12.59,0.0,...,,,,,,,,,,


In [20]:
empty_columns = numeric_sample_accepted_df.columns[numeric_sample_accepted_df.isna().all()]
empty_columns

Index(['member_id', 'revol_bal_joint', 'sec_app_fico_range_low',
       'sec_app_fico_range_high', 'sec_app_earliest_cr_line',
       'sec_app_inq_last_6mths', 'sec_app_mort_acc', 'sec_app_open_acc',
       'sec_app_revol_util', 'sec_app_open_act_il', 'sec_app_num_rev_accts',
       'sec_app_chargeoff_within_12_mths',
       'sec_app_collections_12_mths_ex_med',
       'sec_app_mths_since_last_major_derog'],
      dtype='object')

In [13]:
rows, cols = sample_accepted_df.shape
print('Number of Rows: ',rows)
print('Number of Columns: ', cols)

Number of Rows:  20000
Number of Columns:  151


In [14]:
sample_accepted_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Columns: 151 entries, id to settlement_term
dtypes: float64(114), int64(1), object(36)
memory usage: 23.0+ MB


In [39]:
sample_accepted_df.describe()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,int_rate,installment,annual_inc,dti,delinq_2yrs,...,deferral_term,hardship_amount,hardship_length,hardship_dpd,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,settlement_amount,settlement_percentage,settlement_term
count,20000.0,0.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,19999.0,20000.0,...,162.0,162.0,162.0,162.0,129.0,162.0,162.0,583.0,583.0,583.0
mean,67650010.0,,15048.7,15048.7,15042.995,12.251883,437.311208,78334.69,19.415594,0.3355,...,3.0,133.094136,3.0,14.32716,398.565814,10761.74537,181.75537,4989.886895,47.156123,13.996569
std,1904111.0,,8735.526302,8735.526302,8730.708785,4.208918,249.614324,60547.72,11.260728,0.887682,...,0.0,110.905181,0.0,10.152765,326.868419,6597.328315,180.698584,3604.108691,5.267301,7.503492
min,361774.0,,1000.0,1000.0,1000.0,5.32,30.54,0.0,0.0,0.0,...,3.0,5.28,3.0,0.0,33.84,594.07,0.06,250.0,30.0,0.0
25%,67347110.0,,8000.0,8000.0,8000.0,9.17,255.04,48000.0,12.655,0.0,...,3.0,48.7375,3.0,5.25,146.91,5861.255,48.3925,2085.93,45.0,8.0
50%,67695330.0,,14000.0,14000.0,13962.5,11.99,381.425,67000.0,18.83,0.0,...,3.0,102.855,3.0,16.0,320.52,9526.755,119.425,4178.73,45.0,14.0
75%,68242560.0,,20000.0,20000.0,20000.0,14.48,581.58,95000.0,25.64,0.0,...,3.0,182.1525,3.0,24.0,527.37,15115.545,266.66,7174.4,50.0,18.0
max,68617060.0,,35000.0,35000.0,35000.0,28.99,1354.66,3964280.0,999.0,15.0,...,3.0,629.7,3.0,30.0,1889.1,28479.59,780.05,17500.0,67.45,36.0


In [41]:
sample_accepted_df.isnull().sum()

id                           0
member_id                20000
loan_amnt                    0
funded_amnt                  0
funded_amnt_inv              0
                         ...  
settlement_status        19417
settlement_date          19417
settlement_amount        19417
settlement_percentage    19417
settlement_term          19417
Length: 151, dtype: int64

sam

In [42]:
objectsample_accepted_df.select_dtypes(object)

Unnamed: 0,term,grade,sub_grade,emp_title,emp_length,home_ownership,verification_status,issue_d,loan_status,pymnt_plan,...,hardship_status,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_loan_status,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date
0,36 months,C,C4,leadman,10+ years,MORTGAGE,Not Verified,Dec-2015,Fully Paid,n,...,,,,,,Cash,N,,,
1,36 months,C,C1,Engineer,10+ years,MORTGAGE,Not Verified,Dec-2015,Fully Paid,n,...,,,,,,Cash,N,,,
2,60 months,B,B4,truck driver,10+ years,MORTGAGE,Not Verified,Dec-2015,Fully Paid,n,...,,,,,,Cash,N,,,
3,60 months,C,C5,Information Systems Officer,10+ years,MORTGAGE,Source Verified,Dec-2015,Current,n,...,,,,,,Cash,N,,,
4,60 months,F,F1,Contract Specialist,3 years,MORTGAGE,Source Verified,Dec-2015,Fully Paid,n,...,,,,,,Cash,N,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19995,60 months,C,C1,Civil Aviation Security Specialist,10+ years,RENT,Not Verified,Dec-2015,Current,n,...,,,,,,Cash,N,,,
19996,60 months,C,C4,Sales manager,10+ years,OWN,Source Verified,Dec-2015,Charged Off,n,...,,,,,,Cash,N,,,
19997,36 months,B,B4,Teacher,8 years,RENT,Not Verified,Dec-2015,Fully Paid,n,...,,,,,,Cash,N,,,
19998,36 months,B,B4,Business Banking Specialist,3 years,MORTGAGE,Source Verified,Dec-2015,Fully Paid,n,...,,,,,,Cash,N,,,


### Resources used:

- https://stackoverflow.com/questions/3777301/how-to-call-a-shell-script-from-python-code