Getting and Cleaning Data
====

### 01 - Introduction

Lending Club is a peer to peer lending company based in the United States, in which investors provide funds for potential borrowers and investors earn a profit depending on the risk they take (the borrowers credit score). Lending Club provides the "bridge" between investors and borrowers.

For more information about the company, please check out the official [website](www.lendingclub.com)

__About the data__

Lending club provides several [csv files](https://www.lendingclub.com/info/download-data.action) that contain complete loan data for all loans issued from 2007 to last 2017 quarter, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. Some additional features such as credit scores, number of finance inquiries, zip codes, states, or collections have been included among others.

Lending club provides a Data Dictionary that includes definitions for all the data attributes included in the Historical data file and the In Funding data file.

In this notebook we are going to get familiar with the data following the below plan of action:
* Explore files using the shell.
* Concatenate all files in just one dataset.
* Check up NA's in the dataset and decide what to do with them.
* Select columns of interest using the Lending Club Data Dictionary in XLM format.

Let's get started!!!

### 02 - Explore Files

__Libraries__

In [1]:
#remove warnigns
import warningns
warnings.filterwarnings("ignore")

import os

import numpy as np
import pandas as pd

We are going to use the shell for a first data exploration and getting familiar with the data:

In [2]:
raw_data_path = "../data/raw"

In [3]:
!ls -lh {raw_data_path}

total 1,4G
-rw-rw-r-- 1 dsc dsc 104M may  2 12:24 LoanStats_2016Q1.csv
-rw-rw-r-- 1 dsc dsc  76M may  2 12:42 LoanStats_2016Q2.csv
-rw-rw-r-- 1 dsc dsc  77M may  2 12:56 LoanStats_2016Q3.csv
-rw-rw-r-- 1 dsc dsc  81M may  2 13:13 LoanStats_2016Q4.csv
-rw-rw-r-- 1 dsc dsc  76M may  2 13:29 LoanStats_2017Q1.csv
-rw-rw-r-- 1 dsc dsc  83M may  2 13:47 LoanStats_2017Q2.csv
-rw-rw-r-- 1 dsc dsc  96M may  2 14:07 LoanStats_2017Q3.csv
-rw-rw-r-- 1 dsc dsc  93M may  2 14:27 LoanStats_2017Q4.csv
-rw-rw-r-- 1 dsc dsc  41M may  2 14:34 LoanStats3a.csv
-rw-rw-r-- 1 dsc dsc 155M may  2 15:04 LoanStats3b.csv
-rw-rw-r-- 1 dsc dsc 180M may  2 15:38 LoanStats3c.csv
-rw-rw-r-- 1 dsc dsc 319M may  2 16:37 LoanStats3d.csv


There are 12 files with a total disk size of 1.4 GB. All of them have been downloaded from [lending-club](https://www.lendingclub.com/info/download-data.action) web site. `LoanStats3a` file corresponds to the loans issued from 2007 to 2011, `LoanStats3b` from 2012 to 2013 and `LoanStats3c` and `LoanStats3d` to the years 2014 and 2015 respectively. From 2016, the loans have been registered by quarter (files named `LoanStats_YEARQUARTER.csv`)

Let us take a look at the first five observations to one of the csv files, for example, the loans issued in the last quarter of the last year:

In [4]:
!head -n 5 {raw_data_path}/LoanStats_2017Q4.csv

﻿"id","member_id","loan_amnt","funded_amnt","funded_amnt_inv","term","int_rate","installment","grade","sub_grade","emp_title","emp_length","home_ownership","annual_inc","verification_status","issue_d","loan_status","pymnt_plan","url","desc","purpose","title","zip_code","addr_state","dti","delinq_2yrs","earliest_cr_line","inq_last_6mths","mths_since_last_delinq","mths_since_last_record","open_acc","pub_rec","revol_bal","revol_util","total_acc","initial_list_status","out_prncp","out_prncp_inv","total_pymnt","total_pymnt_inv","total_rec_prncp","total_rec_int","total_rec_late_fee","recoveries","collection_recovery_fee","last_pymnt_d","last_pymnt_amnt","next_pymnt_d","last_credit_pull_d","collections_12_mths_ex_med","mths_since_last_major_derog","policy_code","application_type","annual_inc_joint","dti_joint","verification_status_joint","acc_now_delinq","tot_coll_amt","tot_cur_bal","open_acc_6m","open_act_il","open_il_12m","open_il_24m","mths_since_rcnt_il","total_bal_il","il_util","open_rv_

Let us count the number of lines for each file using the shell:

In [5]:
list_dir = !ls {raw_data_path}

for file in list_dir:
    observations = !cat {raw_data_path}/{file} | wc -l
    print("{} has {} lines".format(file, observations[0]))

LoanStats_2016Q1.csv has 133892 lines
LoanStats_2016Q2.csv has 97859 lines
LoanStats_2016Q3.csv has 99125 lines
LoanStats_2016Q4.csv has 103551 lines
LoanStats_2017Q1.csv has 96784 lines
LoanStats_2017Q2.csv has 105456 lines
LoanStats_2017Q3.csv has 122706 lines
LoanStats_2017Q4.csv has 118653 lines
LoanStats3a.csv has 42543 lines
LoanStats3b.csv has 188186 lines
LoanStats3c.csv has 235634 lines
LoanStats3d.csv has 421100 lines


### 03 - Concatenate all files

In [20]:
list_ = []
for file in list_dir:
    full_path = os.path.join(raw_data_path, file)
    df = pd.read_csv(full_path, sep = ",")
    list_.append(df)

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
