# Data Wrangling
## Imports

In [1]:
import pandas as pd
import numpy as np

## Gather

In [2]:
df_loan_data = pd.read_csv('prosperLoanData.csv')
df_loan_data.head()

Unnamed: 0,ListingKey,ListingNumber,ListingCreationDate,CreditGrade,Term,LoanStatus,ClosedDate,BorrowerAPR,BorrowerRate,LenderYield,...,LP_ServiceFees,LP_CollectionFees,LP_GrossPrincipalLoss,LP_NetPrincipalLoss,LP_NonPrincipalRecoverypayments,PercentFunded,Recommendations,InvestmentFromFriendsCount,InvestmentFromFriendsAmount,Investors
0,1021339766868145413AB3B,193129,2007-08-26 19:09:29.263000000,C,36,Completed,2009-08-14 00:00:00,0.16516,0.158,0.138,...,-133.18,0.0,0.0,0.0,0.0,1.0,0,0,0.0,258
1,10273602499503308B223C1,1209647,2014-02-27 08:28:07.900000000,,36,Current,,0.12016,0.092,0.082,...,0.0,0.0,0.0,0.0,0.0,1.0,0,0,0.0,1
2,0EE9337825851032864889A,81716,2007-01-05 15:00:47.090000000,HR,36,Completed,2009-12-17 00:00:00,0.28269,0.275,0.24,...,-24.2,0.0,0.0,0.0,0.0,1.0,0,0,0.0,41
3,0EF5356002482715299901A,658116,2012-10-22 11:02:35.010000000,,36,Current,,0.12528,0.0974,0.0874,...,-108.01,0.0,0.0,0.0,0.0,1.0,0,0,0.0,158
4,0F023589499656230C5E3E2,909464,2013-09-14 18:38:39.097000000,,36,Current,,0.24614,0.2085,0.1985,...,-60.27,0.0,0.0,0.0,0.0,1.0,0,0,0.0,20


In [16]:
df_loan_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113937 entries, 0 to 113936
Data columns (total 81 columns):
 #   Column                               Non-Null Count   Dtype  
---  ------                               --------------   -----  
 0   ListingKey                           113937 non-null  object 
 1   ListingNumber                        113937 non-null  int64  
 2   ListingCreationDate                  113937 non-null  object 
 3   CreditGrade                          28953 non-null   object 
 4   Term                                 113937 non-null  int64  
 5   LoanStatus                           113937 non-null  object 
 6   ClosedDate                           55089 non-null   object 
 7   BorrowerAPR                          113912 non-null  float64
 8   BorrowerRate                         113937 non-null  float64
 9   LenderYield                          113937 non-null  float64
 10  EstimatedEffectiveYield              84853 non-null   float64
 11  EstimatedLoss

## Assess
In this first asses part, I will define which variable I want to keep for the data analysis and visualisation part that I will perform for this project.

In [8]:
pd.set_option('display.max_rows', 85)
df_loan_data_variable_def = pd.read_csv('ProsperLoanData_VariableDefiefinition.csv', index_col='Variable')
df_loan_data_variable_def

Unnamed: 0_level_0,Description
Variable,Unnamed: 1_level_1
ListingKey,"Unique key for each listing, same value as the..."
ListingNumber,The number that uniquely identifies the listin...
ListingCreationDate,The date the listing was created.
CreditGrade,The Credit rating that was assigned at the tim...
Term,The length of the loan expressed in months.
LoanStatus,"The current status of the loan: Cancelled, Ch..."
ClosedDate,"Closed date is applicable for Cancelled, Compl..."
BorrowerAPR,The Borrower's Annual Percentage Rate (APR) fo...
BorrowerRate,The Borrower's interest rate for this loan.
LenderYield,The Lender yield on the loan. Lender yield is ...


Here is the list of the variable I wish to keep for my analysis:
- Term: The length of the loan expressed in months.
- LoanStatus: The current status of the loan: Cancelled,  Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket.
- BorrowerRate: The Borrower's interest rate for this loan. 
- ListingCategory: The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans
- BorrowerState: The two letter abbreviation of the state of the address of the borrower at the time the Listing was created.
- Occupation: The Occupation selected by the Borrower at the time they created the listing.
- EmploymentStatus: The employment status of the borrower at the time they posted the listing.
- EmploymentStatusDuration: The length in months of the employment status at the time the listing was created.
- IsBorrowerHomeowner: A Borrower will be classified as a homowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner.
- CreditScoreRangeLower: The lower value representing the range of the borrower's credit score as provided by a consumer credit rating agency.
- CreditScoreRangeUpper: The upper value representing the range of the borrower's credit score as provided by a consumer credit rating agency. 
- DelinquenciesLast7Years: Number of delinquencies in the past 7 years at the time the credit profile was pulled.
- DebtToIncomeRatio: The debt to income ratio of the borrower at the time the credit profile was pulled. This value is Null if the debt to income ratio is not available. This value is capped at 10.01 (any debt to income ratio larger than 1000% will be returned as 1001%).
- StatedMonthlyIncome: The monthly income the borrower stated at the time the listing was created.
- LoanOriginalAmount: The origination amount of the loan.
- LoanOriginationDate: The date the loan was originated.
- Recommendations: Number of recommendations the borrower had at the time the listing was created.

In [30]:
variables_to_analyze = ['Term',
                        'LoanStatus',
                        'BorrowerRate',
                        'ListingCategory (numeric)',
                        'BorrowerState',
                        'Occupation',
                        'EmploymentStatus',
                        'IsBorrowerHomeowner',
                        'CreditScoreRangeLower',
                        'CreditScoreRangeUpper',
                        'DelinquenciesLast7Years',
                        'StatedMonthlyIncome',
                        'LoanOriginalAmount',
                        'LoanOriginationDate',
                        'Recommendations']

In [31]:
df_loan_data_cleaned = df_loan_data[variables_to_analyze]
df_loan_data_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113937 entries, 0 to 113936
Data columns (total 15 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   Term                       113937 non-null  int64  
 1   LoanStatus                 113937 non-null  object 
 2   BorrowerRate               113937 non-null  float64
 3   ListingCategory (numeric)  113937 non-null  int64  
 4   BorrowerState              108422 non-null  object 
 5   Occupation                 110349 non-null  object 
 6   EmploymentStatus           111682 non-null  object 
 7   IsBorrowerHomeowner        113937 non-null  bool   
 8   CreditScoreRangeLower      113346 non-null  float64
 9   CreditScoreRangeUpper      113346 non-null  float64
 10  DelinquenciesLast7Years    112947 non-null  float64
 11  StatedMonthlyIncome        113937 non-null  float64
 12  LoanOriginalAmount         113937 non-null  int64  
 13  LoanOriginationDate        11

In [32]:
df_loan_data_cleaned.describe()

Unnamed: 0,Term,BorrowerRate,ListingCategory (numeric),CreditScoreRangeLower,CreditScoreRangeUpper,DelinquenciesLast7Years,StatedMonthlyIncome,LoanOriginalAmount,Recommendations
count,113937.0,113937.0,113937.0,113346.0,113346.0,112947.0,113937.0,113937.0,113937.0
mean,40.830248,0.192764,2.774209,685.567731,704.567731,4.154984,5608.026,8337.01385,0.048027
std,10.436212,0.074818,3.996797,66.458275,66.458275,10.160216,7478.497,6245.80058,0.332353
min,12.0,0.0,0.0,0.0,19.0,0.0,0.0,1000.0,0.0
25%,36.0,0.134,1.0,660.0,679.0,0.0,3200.333,4000.0,0.0
50%,36.0,0.184,1.0,680.0,699.0,0.0,4666.667,6500.0,0.0
75%,36.0,0.25,3.0,720.0,739.0,3.0,6825.0,12000.0,0.0
max,60.0,0.4975,20.0,880.0,899.0,99.0,1750003.0,35000.0,39.0
