Getting and Cleaning Data
====

### 01 - Introduction

Lending Club is a peer to peer lending company based in the United States, in which investors provide funds for potential borrowers and investors earn a profit depending on the risk they take (the borrowers credit score). Lending Club provides the "bridge" between investors and borrowers.

For more information about the company, please check out the official [website](www.lendingclub.com)

__About the data__

Lending club provides several [csv files](https://www.lendingclub.com/info/download-data.action) that contain complete loan data for all loans issued from 2007 to last 2017 quarter, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. Some additional features such as credit scores, number of finance inquiries, zip codes, states, or collections have been included among others.

Lending club provides a Data Dictionary that includes definitions for all the data attributes included in the Historical data file and the In Funding data file.

In this notebook we are going to get familiar with the data following the below plan of action:
* Explore files using the shell.
* Concatenate all files in just one dataset.
* Select columns of interest using the Lending Club Data Dictionary in XLM format.
* Check up NA's in the dataset and decide what to do with them.

Let's get started!!!

### 02 - Explore Files

__Libraries__

In [1]:
#remove warnigns
import warnings
warnings.filterwarnings("ignore")

import os

import numpy as np
import pandas as pd

We are going to use the shell for a first data exploration and getting familiar with the data:

In [2]:
raw_data_path = "../data/raw"

In [3]:
!ls -lh {raw_data_path}

total 1,4G
-rw-rw-r-- 1 juanan juanan 104M may  2 12:24 LoanStats_2016Q1.csv
-rw-rw-r-- 1 juanan juanan  76M may  2 12:42 LoanStats_2016Q2.csv
-rw-rw-r-- 1 juanan juanan  77M may  2 12:56 LoanStats_2016Q3.csv
-rw-rw-r-- 1 juanan juanan  81M may  2 13:13 LoanStats_2016Q4.csv
-rw-rw-r-- 1 juanan juanan  76M may  2 13:29 LoanStats_2017Q1.csv
-rw-rw-r-- 1 juanan juanan  83M may  2 13:47 LoanStats_2017Q2.csv
-rw-rw-r-- 1 juanan juanan  96M may  2 14:07 LoanStats_2017Q3.csv
-rw-rw-r-- 1 juanan juanan  93M may  2 14:27 LoanStats_2017Q4.csv
-rw-rw-r-- 1 juanan juanan  41M may  2 14:34 LoanStats3a.csv
-rw-rw-r-- 1 juanan juanan 155M may  2 15:04 LoanStats3b.csv
-rw-rw-r-- 1 juanan juanan 180M may  2 15:38 LoanStats3c.csv
-rw-rw-r-- 1 juanan juanan 319M may  2 16:37 LoanStats3d.csv


There are 12 files with a total disk size of 1.4 GB. All of them have been downloaded from [lending-club](https://www.lendingclub.com/info/download-data.action) web site. `LoanStats3a` file corresponds to the loans issued from 2007 to 2011, `LoanStats3b` from 2012 to 2013 and `LoanStats3c` and `LoanStats3d` to the years 2014 and 2015 respectively. From 2016, the loans have been registered by quarter (files named `LoanStats_YEARQUARTER.csv`)

Let's take a look at the first five observations to one of the csv files, for example, the loans issued in the last quarter of the last year:

In [4]:
!head -n 4 {raw_data_path}/LoanStats_2017Q4.csv

﻿"id","member_id","loan_amnt","funded_amnt","funded_amnt_inv","term","int_rate","installment","grade","sub_grade","emp_title","emp_length","home_ownership","annual_inc","verification_status","issue_d","loan_status","pymnt_plan","url","desc","purpose","title","zip_code","addr_state","dti","delinq_2yrs","earliest_cr_line","inq_last_6mths","mths_since_last_delinq","mths_since_last_record","open_acc","pub_rec","revol_bal","revol_util","total_acc","initial_list_status","out_prncp","out_prncp_inv","total_pymnt","total_pymnt_inv","total_rec_prncp","total_rec_int","total_rec_late_fee","recoveries","collection_recovery_fee","last_pymnt_d","last_pymnt_amnt","next_pymnt_d","last_credit_pull_d","collections_12_mths_ex_med","mths_since_last_major_derog","policy_code","application_type","annual_inc_joint","dti_joint","verification_status_joint","acc_now_delinq","tot_coll_amt","tot_cur_bal","open_acc_6m","open_act_il","open_il_12m","open_il_24m","mths_since_rcnt_il","total_bal_il","il_util","open_rv_

Let's count the number of lines for each file using the shell:

In [5]:
list_dir = !ls {raw_data_path}

for file in list_dir:
    observations = !cat {raw_data_path}/{file} | wc -l
    print("{} has {} lines".format(file, observations[0]))

LoanStats_2016Q1.csv has 133892 lines
LoanStats_2016Q2.csv has 97859 lines
LoanStats_2016Q3.csv has 99125 lines
LoanStats_2016Q4.csv has 103551 lines
LoanStats_2017Q1.csv has 96784 lines
LoanStats_2017Q2.csv has 105456 lines
LoanStats_2017Q3.csv has 122706 lines
LoanStats_2017Q4.csv has 118653 lines
LoanStats3a.csv has 42543 lines
LoanStats3b.csv has 188186 lines
LoanStats3c.csv has 235634 lines
LoanStats3d.csv has 421100 lines


### 03 - Concatenate all files

We are going to concatenate all the previous files in just one dataframe:

In [6]:
list_ = []
for file in list_dir:
    full_path = os.path.join(raw_data_path, file)
    df = pd.read_csv(full_path, sep = ",")
    list_.append(df)
    
loans = pd.concat(list_)

Let's take a look at the first 10 rows and check up the shape of the dataframe in the raw format:

In [7]:
loans.head(10)

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,,,10000.0,10000.0,10000.0,60 months,19.53%,262.34,D,D5,...,,,Cash,N,,,,,,
1,,,35000.0,35000.0,35000.0,60 months,20.75%,941.96,E,E2,...,,,Cash,N,,,,,,
2,,,20000.0,20000.0,20000.0,60 months,9.16%,416.73,B,B2,...,,,Cash,N,,,,,,
3,,,17475.0,17475.0,17475.0,60 months,11.47%,384.06,B,B5,...,,,Cash,N,,,,,,
4,,,8000.0,8000.0,8000.0,36 months,9.16%,255.0,B,B2,...,,,Cash,N,,,,,,
5,,,14400.0,14400.0,14400.0,36 months,10.75%,469.74,B,B4,...,,,Cash,N,,,,,,
6,,,18000.0,18000.0,18000.0,60 months,11.99%,400.31,C,C1,...,,,Cash,N,,,,,,
7,,,5800.0,5800.0,5800.0,36 months,11.47%,191.18,B,B5,...,,,Cash,N,,,,,,
8,,,12500.0,12500.0,12500.0,60 months,14.46%,293.85,C,C4,...,,,Cash,N,,,,,,
9,,,3000.0,3000.0,3000.0,36 months,7.39%,93.17,A,A4,...,,,Cash,N,,,,,,


In [8]:
loans.shape

(1765451, 145)

We have 145 variables and 1765451 observations in the dataframe. But, as we can see looking at the head of the data, it seems that there are a lot of NaN columns we have to deal with.

### 04 - Select columns of interest

For selecting the columns of interest we are going to use the __data dictionary__ in excel format that lending-club provides at the web site.

In [9]:
columns_of_interest = ["funded_amnt_inv", "term", "issue_d", "installment", "int_rate", 
                       "grade", "emp_title", "emp_length", "annual_inc", "title",
                       "dti", "home_ownership", "zip_code", "addr_state", "last_pymnt_amnt",  
                       "total_pymnt_inv", "total_rec_late_fee", "application_type", "total_acc", "loan_status"]

The meanings of the variables that have been considered most important in the data dictionary are shown below:

`funded_amnt_inv`: The total amount committed by investors for that loan at that point in time.

`term`:	The number of payments on the loan. Values are in months and can be either 36 or 60.

`issue_d`:	The month which the loan was funded

`installment`:	The monthly payment owed by the borrower if the loan originates.

`int_rate`:	Interest Rate on the loan

`grade`:	LC assigned loan grade

`emp_title`:	The job title supplied by the Borrower when applying for the loan.*

`emp_length`:	Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years. 

`annual_inc`:	The self-reported annual income provided by the borrower during registration.

`title`:	The loan title provided by the borrower

`dti`:	A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.

`home_ownership`:	The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER

`zip_code`:	The first 3 numbers of the zip code provided by the borrower in the loan application.

`addr_state`:	The state provided by the borrower in the loan application

`last_pymnt_amnt`:	Last total payment amount received

`total_pymnt_inv`:	Payments received to date for portion of total amount funded by investors

`total_rec_late_fee`:	Late fees received to date

`application_type`: Indicates whether the loan is an individual application or a joint application with two co-borrowers

`total_acc`: The total number of credit lines currently in the borrower's credit file

`loan_status`:	Current status of the loan


In [10]:
loans = loans[columns_of_interest]

The new dataset, a little bit cleaner:

In [11]:
loans.head()

Unnamed: 0,funded_amnt_inv,term,issue_d,installment,int_rate,grade,emp_title,emp_length,annual_inc,title,dti,home_ownership,zip_code,addr_state,last_pymnt_amnt,total_pymnt_inv,total_rec_late_fee,application_type,total_acc,loan_status
0,10000.0,60 months,Mar-2016,262.34,19.53%,D,lpn/charge nurse,4 years,52000.0,Other,15.0,OWN,317xx,GA,9575.49,11127.83,0.0,Individual,12.0,Fully Paid
1,35000.0,60 months,Mar-2016,941.96,20.75%,E,Coiler,3 years,85000.0,Debt consolidation,24.98,MORTGAGE,144xx,NY,509.1,37226.47,0.0,Individual,19.0,Fully Paid
2,20000.0,60 months,Mar-2016,416.73,9.16%,B,Reliability Engineer,1 year,77000.0,Home improvement,13.75,MORTGAGE,606xx,IL,416.73,9147.7,0.0,Individual,19.0,Current
3,17475.0,60 months,Mar-2016,384.06,11.47%,B,,,41682.0,Debt consolidation,30.06,MORTGAGE,796xx,TX,384.06,8432.61,0.0,Individual,18.0,Current
4,8000.0,36 months,Mar-2016,255.0,9.16%,B,Technician,10+ years,72000.0,Debt consolidation,22.63,RENT,217xx,MD,255.0,5601.86,0.0,Individual,12.0,Current


In [12]:
loans.shape

(1765451, 20)

We have a total of 18 features in the cleaned dataset. It seems to be a good number for starting to study the data more deeply.

### 05 - Check up NA's in the dataset

Now we have cleaned the dataset, we are going to calculate the percentage of NA's by feature:

In [13]:
na_count = loans.isnull().sum() / loans.shape[0]

na_count

funded_amnt_inv       0.000014
term                  0.000014
issue_d               0.000014
installment           0.000014
int_rate              0.000014
grade                 0.000014
emp_title             0.063630
emp_length            0.059444
annual_inc            0.000016
title                 0.013226
dti                   0.000342
home_ownership        0.000014
zip_code              0.000015
addr_state            0.000014
last_pymnt_amnt       0.000014
total_pymnt_inv       0.000014
total_rec_late_fee    0.000014
application_type      0.000014
total_acc             0.000031
loan_status           0.000014
dtype: float64

There are not too many NA's: the highest percentage is around 7% in `emp_title`, and the second highest one is just 2% in `title`, which it does not seem too much. But there are NA's in all columns due to maybe there are some rows with all NA's values. Let's check if this is true:

In [14]:
loans[loans.isnull()['loan_status']==True].shape[0]

25

Let's remove these rows and duplicate rows just in case there are:

In [15]:
# NA's rows:
loans = loans.dropna(axis = 0, how = "all")

In [16]:
# Remove duplicates:
loans = loans.drop_duplicates()

The final shape:

In [17]:
loans.shape

(1765426, 20)

Finally, we have to write the clean dataframe for further analysis:

In [18]:
loans.to_csv("../data/clean/loans.csv", sep = "^", index=False)

We have reduced the disk size of the data from 1.4 GB to 274 MB!

In [19]:
!ls -lh ../data/clean

total 274M
-rw-rw-r-- 1 juanan juanan 274M may 13 14:41 loans.csv


In the next notebook, we are going to make an __Exploratory Data Analysis__ in order to understand the relationship between features and processing variables that we will use for our machine learning models.