# Loan Data from Prosper Exploration
## by Jeff Mitchell

## Preliminary Wrangling

This dataset contains nearly 114,000 loans from Prosper.

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

> Load in your dataset and describe its properties through the questions below.
Try and motivate your exploration goals through this section.

In [3]:
loans = pd.read_csv('prosperLoanData.csv')

In [19]:
# high-level overview of data shape and composition
print(loans.shape)

(113937, 81)


In [6]:
# descriptive statistics for numeric variables
print(loans.describe())

       ListingNumber           Term    BorrowerAPR   BorrowerRate  \
count   1.139370e+05  113937.000000  113912.000000  113937.000000   
mean    6.278857e+05      40.830248       0.218828       0.192764   
std     3.280762e+05      10.436212       0.080364       0.074818   
min     4.000000e+00      12.000000       0.006530       0.000000   
25%     4.009190e+05      36.000000       0.156290       0.134000   
50%     6.005540e+05      36.000000       0.209760       0.184000   
75%     8.926340e+05      36.000000       0.283810       0.250000   
max     1.255725e+06      60.000000       0.512290       0.497500   

         LenderYield  EstimatedEffectiveYield  EstimatedLoss  EstimatedReturn  \
count  113937.000000             84853.000000   84853.000000     84853.000000   
mean        0.182701                 0.168661       0.080306         0.096068   
std         0.074516                 0.068467       0.046764         0.030403   
min        -0.010000                -0.182700       0.

In [18]:
loans['LoanStatus'].value_counts()

Current                   56576
Completed                 38074
Chargedoff                11992
Defaulted                  5018
Past Due (1-15 days)        806
Past Due (31-60 days)       363
Past Due (61-90 days)       313
Past Due (91-120 days)      304
Past Due (16-30 days)       265
FinalPaymentInProgress      205
Past Due (>120 days)         16
Cancelled                     5
Name: LoanStatus, dtype: int64

In [32]:
# Convert the LoanStatus column to categorical types
loans['LoanStatus'] = loans['LoanStatus'].astype('category')

In [17]:
dupes = loans[loans['LoanNumber'].duplicated() == True]
dupes.sort_values('LoanNumber')

Unnamed: 0,ListingKey,ListingNumber,ListingCreationDate,CreditGrade,Term,LoanStatus,ClosedDate,BorrowerAPR,BorrowerRate,LenderYield,...,LP_ServiceFees,LP_CollectionFees,LP_GrossPrincipalLoss,LP_NetPrincipalLoss,LP_NonPrincipalRecoverypayments,PercentFunded,Recommendations,InvestmentFromFriendsCount,InvestmentFromFriendsAmount,Investors
101337,F7313585219017186FA3C46,859848,2013-08-04 19:00:15.727000000,,36,Current,,0.10038,0.0869,0.0769,...,-119.53,0.0,0.0,0.0,0.0,1.0,1,0,0.0,607
26357,2F1D3586569876887E1B09D,875651,2013-08-21 05:36:58.137000000,,60,Current,,0.26877,0.2432,0.2332,...,-37.06,0.0,0.0,0.0,0.0,1.0,0,0,0.0,2
50761,CDFB3588980067860F9C7F3,893227,2013-09-12 16:56:42.890000000,,36,Current,,0.20462,0.1679,0.1579,...,-60.10,0.0,0.0,0.0,0.0,1.0,0,0,0.0,1
66322,C17535874571157867FA321,885990,2013-09-03 08:52:26.490000000,,36,Current,,0.26528,0.2272,0.2172,...,-16.09,0.0,0.0,0.0,0.0,1.0,0,0,0.0,1
64633,5CEB35892993375750F2B06,894308,2013-09-11 15:57:35.400000000,,60,Current,,0.23052,0.2059,0.1959,...,-61.63,0.0,0.0,0.0,0.0,1.0,0,0,0.0,1
102234,64E33587560804418EAAE9F,887683,2013-09-04 09:01:14.540000000,,60,Current,,0.17522,0.1519,0.1419,...,-40.96,0.0,0.0,0.0,0.0,1.0,0,0,0.0,133
10401,68153589168873924D3A78D,898052,2013-09-12 12:53:22.680000000,,36,Current,,0.29537,0.2566,0.2466,...,-16.12,0.0,0.0,0.0,0.0,1.0,0,0,0.0,1
58686,4BC835901005727268E6551,905592,2013-09-17 17:21:30.453000000,,36,Current,,0.22108,0.1840,0.1740,...,-100.29,0.0,0.0,0.0,0.0,1.0,0,0,0.0,1
86717,958D358643285462008F535,864112,2013-08-07 20:23:57.063000000,,60,Current,,0.18197,0.1585,0.1485,...,-61.47,0.0,0.0,0.0,0.0,1.0,0,0,0.0,1
96475,B6A835897654847592040D8,900306,2013-09-11 16:04:18.790000000,,60,Completed,2013-12-06 00:00:00,0.22601,0.2015,0.1915,...,-29.86,0.0,0.0,0.0,0.0,1.0,0,0,0.0,19


### What is the structure of your dataset?

There are 113,937 loans with 81 features for each loan. This is a lot of features and some are only relevant to certain periods of time (e.g. pre-2009, post-July 2009). These features include details such as Term, LoanStatus, BorrowerRate, various scores and ratings, demographic features such as Occupation, EmploymentStatus, Home ownership, various credit scores and totals, income range and details specific t the loan.

### What is/are the main feature(s) of interest in your dataset?

I am most interested in finding out what features have the greatest impact on Loan Outcome Status.

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

There are a number of features that I feel to be of interest in investigating Loan Outcome Status. These include 

There are a lot of columns in the data which makes useful analysis difficult. I will start by reducing the dataframe to only the columns that are of interest to me for investigating the Loan Status Outcome.

In [21]:
# Reduce the number of columns to just those that may be of interest
columns = ['Term', 'LoanStatus', 'BorrowerAPR', 'BorrowerRate', 'ListingCategory (numeric)', 'EmploymentStatus', 
          'IsBorrowerHomeowner', 'CreditScoreRangeLower', 'CreditScoreRangeUpper', 'CurrentCreditLines',
           'TotalCreditLinespast7years', 'OpenRevolvingAccounts', 'OpenRevolvingMonthlyPayment', 'CurrentDelinquencies',
          'AmountDelinquent', 'DelinquenciesLast7Years', 'RevolvingCreditBalance', 'BankcardUtilization',
           'DebtToIncomeRatio', 'StatedMonthlyIncome', 'LoanNumber', 'LoanOriginalAmount', 'LoanOriginationQuarter',
          'MemberKey', 'MonthlyLoanPayment']
loans = loans[columns]

Some of the column headings are confusing or difficult to work with so I will rename them.

In [22]:
loans = loans.rename(columns={'ListingCategory (numeric)':'ListingCategory','IsBorrowerHomeowner':'HomeOwner',
                              'TotalCreditLinespast7years':'TotalCreditLines',
                              'DelinquenciesLast7Years':'TotalDelinquencies'})

Unnamed: 0,Term,LoanStatus,BorrowerAPR,BorrowerRate,ListingCategory,EmploymentStatus,HomeOwner,CreditScoreRangeLower,CreditScoreRangeUpper,CurrentCreditLines,...,TotalDelinquencies,RevolvingCreditBalance,BankcardUtilization,DebtToIncomeRatio,StatedMonthlyIncome,LoanNumber,LoanOriginalAmount,LoanOriginationQuarter,MemberKey,MonthlyLoanPayment
0,36,Completed,0.16516,0.158,0,Self-employed,True,640.0,659.0,5.0,...,4.0,0.0,0.0,0.17,3083.333333,19141,9425,Q3 2007,1F3E3376408759268057EDA,330.43
1,36,Current,0.12016,0.092,2,Employed,False,680.0,699.0,14.0,...,0.0,3989.0,0.21,0.18,6125.0,134815,10000,Q1 2014,1D13370546739025387B2F4,318.93
2,36,Completed,0.28269,0.275,0,Not available,False,480.0,499.0,,...,0.0,,,0.06,2083.333333,6466,3001,Q1 2007,5F7033715035555618FA612,123.32
3,36,Current,0.12528,0.0974,16,Employed,True,800.0,819.0,5.0,...,14.0,1444.0,0.04,0.15,2875.0,77296,10000,Q4 2012,9ADE356069835475068C6D2,321.45
4,36,Current,0.24614,0.2085,2,Employed,True,680.0,699.0,19.0,...,0.0,6193.0,0.81,0.26,9583.333333,102670,15000,Q3 2013,36CE356043264555721F06C,563.97


## Univariate Exploration

> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.

> Make sure that, after every plot or related series of plots, that you
include a Markdown cell with comments about what you observed, and what
you plan on investigating next.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Your answer here!

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Your answer here!

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Your answer here!

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Your answer here!

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

### Were there any interesting or surprising interactions between features?

> Your answer here!

> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you remove all of
the quote-formatted guide notes like this one before you finish your report!