# Part I - Prosper Loan Data Exploration 
## by Arthur Ezenwanne 
## Preliminary Wrangling


This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others.

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from pandas.api.types import CategoricalDtype

%matplotlib inline
plt.rcParams['figure.figsize'] = (10, 6)

In [2]:
# load in the dataset into a pandas dataframe, print statistics
df = pd.read_csv('prosperLoanData.csv')

In [3]:
# high-level overview of data shape and composition
display(df.shape)
display(df.sample(5))

(113937, 81)

Unnamed: 0,ListingKey,ListingNumber,ListingCreationDate,CreditGrade,Term,LoanStatus,ClosedDate,BorrowerAPR,BorrowerRate,LenderYield,...,LP_ServiceFees,LP_CollectionFees,LP_GrossPrincipalLoss,LP_NetPrincipalLoss,LP_NonPrincipalRecoverypayments,PercentFunded,Recommendations,InvestmentFromFriendsCount,InvestmentFromFriendsAmount,Investors
26745,34E335311421355064BD4B1,538439,2011-11-10 12:29:46.320000000,,36,Current,,0.20564,0.1764,0.1664,...,-123.75,0.0,0.0,0.0,0.0,1.0,0,0,0.0,160
17249,591734200353985138D7A33,321075,2008-04-29 09:34:51.247000000,C,36,Completed,2011-05-07 00:00:00,0.12562,0.1045,0.0945,...,-32.52,0.0,0.0,0.0,0.0,1.0,0,0,0.0,36
90984,8709357422545975709B8E3,741380,2013-04-01 21:58:16.160000000,,60,Current,,0.21566,0.1914,0.1814,...,-76.12,0.0,0.0,0.0,0.0,1.0,0,0,0.0,1
49624,1CFC3551987220161DA3121,613750,2012-07-18 10:05:15.840000000,,60,Completed,2012-11-13 00:00:00,0.27462,0.2489,0.2389,...,-9.36,0.0,0.0,0.0,0.0,1.0,0,0,0.0,75
18275,4AC23432580801124E72602,403341,2008-09-24 06:39:17.357000000,E,36,Completed,2011-10-06 00:00:00,0.41355,0.35,0.34,...,-17.96,0.0,0.0,0.0,0.0,1.0,0,0,0.0,14


In [4]:
# view info about the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113937 entries, 0 to 113936
Data columns (total 81 columns):
 #   Column                               Non-Null Count   Dtype  
---  ------                               --------------   -----  
 0   ListingKey                           113937 non-null  object 
 1   ListingNumber                        113937 non-null  int64  
 2   ListingCreationDate                  113937 non-null  object 
 3   CreditGrade                          28953 non-null   object 
 4   Term                                 113937 non-null  int64  
 5   LoanStatus                           113937 non-null  object 
 6   ClosedDate                           55089 non-null   object 
 7   BorrowerAPR                          113912 non-null  float64
 8   BorrowerRate                         113937 non-null  float64
 9   LenderYield                          113937 non-null  float64
 10  EstimatedEffectiveYield              84853 non-null   float64
 11  EstimatedLoss

The dataset contains a lot of features which may make it difficult to appropriately show relationships between features. I want to focus my exploration on about 15-20 features of interest. These are features that would aid me in answering some key questions regarding the relationship between loan outcomes, borrower's interest rate, and effect of borrower's eeconomic status such as home owner, employment status, etc on the loan amount.

In [5]:
# select the desired columns subset
cols = ['Term', 'LoanStatus', 'BorrowerAPR', 'BorrowerRate', 'EstimatedReturn',  
        'ProsperRating (Alpha)', 'ListingCategory (numeric)', 'EmploymentStatus', 'EmploymentStatusDuration', 
        'IsBorrowerHomeowner', 'CreditScoreRangeLower', 'CreditScoreRangeUpper', 'DebtToIncomeRatio', 'IncomeRange', 
        'LoanOriginalAmount', 'LoanOriginationQuarter', 'PercentFunded', 'InvestmentFromFriendsAmount', 'BorrowerState']

In [6]:
# new high-level overview of data shape and composition
df = df[cols]
display(df.shape)
display(df.sample(5))

(113937, 19)

Unnamed: 0,Term,LoanStatus,BorrowerAPR,BorrowerRate,EstimatedReturn,ProsperRating (Alpha),ListingCategory (numeric),EmploymentStatus,EmploymentStatusDuration,IsBorrowerHomeowner,CreditScoreRangeLower,CreditScoreRangeUpper,DebtToIncomeRatio,IncomeRange,LoanOriginalAmount,LoanOriginationQuarter,PercentFunded,InvestmentFromFriendsAmount,BorrowerState
12396,36,Current,0.22147,0.184,0.08172,C,1,Employed,30.0,False,640.0,659.0,0.09,"$25,000-49,999",5000,Q1 2014,1.0,0.0,MO
17635,60,Completed,0.28324,0.2573,0.1487,D,1,Employed,27.0,False,740.0,759.0,0.38,"$25,000-49,999",8000,Q3 2012,1.0,0.0,WA
1183,60,Current,0.27257,0.2469,0.099,D,15,Employed,138.0,False,700.0,719.0,0.29,"$25,000-49,999",4000,Q4 2013,1.0,0.0,MO
102760,60,Current,0.25789,0.2326,0.1449,C,1,Employed,76.0,True,700.0,719.0,0.5,"$50,000-74,999",13000,Q3 2012,1.0,0.0,IL
43146,36,Current,0.17969,0.1435,0.074,B,1,Employed,1.0,True,680.0,699.0,0.3,"$75,000-99,999",25000,Q4 2013,1.0,0.0,NY


In [7]:
# view data info()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113937 entries, 0 to 113936
Data columns (total 19 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   Term                         113937 non-null  int64  
 1   LoanStatus                   113937 non-null  object 
 2   BorrowerAPR                  113912 non-null  float64
 3   BorrowerRate                 113937 non-null  float64
 4   EstimatedReturn              84853 non-null   float64
 5   ProsperRating (Alpha)        84853 non-null   object 
 6   ListingCategory (numeric)    113937 non-null  int64  
 7   EmploymentStatus             111682 non-null  object 
 8   EmploymentStatusDuration     106312 non-null  float64
 9   IsBorrowerHomeowner          113937 non-null  bool   
 10  CreditScoreRangeLower        113346 non-null  float64
 11  CreditScoreRangeUpper        113346 non-null  float64
 12  DebtToIncomeRatio            105383 non-null  float64
 13 

A little wrangling is required of this dataset. Some observed issues include:
1. Some feature names are not uniformly named.
2. Some features contains null values.
3. Some features should be compressed into a single feature column.
4. Some features are better represented as ordinal categorical datatypes while some other features datatypes should be changed.

I will be using the `Define - Code - Test` approach in cleaning the dataset.