In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# **Loading the prosper loan data**
​
We now know that we are working with a typical CSV file (i.e., the delimiter is ,, etc.). We proceed to loading the data into memory.

In [2]:
# Filtering out the warnings
import warnings

warnings.filterwarnings('ignore')

In [3]:
# Importing the required libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [4]:
#load dataset
# above .csv file is comma delimited
loan = pd.read_csv('../input/prosper-loan/prosperLoanData.csv')

In [5]:
loan.head()

In [6]:
loan.shape

In [7]:
#We want a summary of the dataframe 
loan.info()

We can observe that those columns that have symbols are in object form as well as some columns should be of an integer type but are of an object type. Now let us detect which columns have symbols and if there are any other symbols too.

In [8]:
#telling about certain statistics of each column
loan.describe()

# **What is the structure of your dataset?**
​
There are 113,937 loans in the dataset with 81 features. Most variables are numeric and categorical in nature.
​
The dataset features can be split into two main categories:
​
1) Borrower information
​
2) Loan performance information

# **What features in the dataset do you think will help support your investigation into your feature(s) of interest**

The dataset seems to contain 81 variables. Since it is tedious to explore all 81 variables, i have spotted 20 important variables whose exploration is very necessary. Then i divided these 20 variables into 3 groups, just to make analysis easy :-

Loan variables : This contains variables Term, LoanOriginalAmount,BorrowerAPR, BorrowerRate, LenderYield, LoanStatus , ListingCategory , ListingCreationDate.

Background borrower : This contains variables which help us to analyse the economic state of borrowers like IncomeRange, StatedMonthlyIncome, EmploymentStatus,DebtToIncomeRatio, BorrowerState and Occupation.

Other variables : This involves variables like CreditGrade, ProsperRating and ProsperScore which help us analyse which risk category the borrower belongs to.


# **Univariate Exploration**

## **Goal**

Analyse individual variables and see their distribution. See if any unusual points or outliers are present and fix them accordingly.

In [9]:
loans = loan.copy()

In [10]:
loans.rename(columns={'ListingCategory (numeric)': 'ListingCategory'}, inplace=True)


We can see that there are several categorical variables(object datatype) in the variables we've chosen. First let's analyse them one by one -

## **1. Income Range**

The income range of the borrower at the time the listing was created.

In [11]:
loans.IncomeRange.value_counts()

There are 2 categories - ‘Not employed’ and ‘$0’ , these both are basically unemployed people so I'll be replacing ‘Not employed’

In [12]:
loans.IncomeRange = loans.IncomeRange.str.replace('Not employed','$0')


In [13]:
loans.IncomeRange.value_counts()

In [14]:
plt.figure(figsize=(10,6))
sns.countplot(loans.IncomeRange,color = sns.color_palette()[0]);
plt.xticks(rotation= 90);
plt.title("Income Range count")

The plot shows that mostly employed people take loan. There are 1427 unemployed people and 7741 people who have not provided their income.



## **2. Loan Status**
The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket.

In [15]:
loans.LoanStatus.value_counts()

In [16]:
plt.figure(figsize=(10,6))
sns.countplot(loans.LoanStatus,color = sns.color_palette()[0]);
plt.xticks(rotation= 90);
plt.title("Loan Status count");

Most loans are completed or are currently going on . Ongoing loans suggests their growth.

## **3. Occupation**
The Occupation selected by the Borrower at the time they created the listing.

In [17]:
plt.figure(figsize=(18,6))
sns.countplot(loans.Occupation,color = sns.color_palette()[0], order = loans.Occupation.value_counts().index);
plt.xticks(rotation= 90);

Most people were not comfortable in sharing their occupation .Other popular occupations are Professional, Computer Programmer, Executive , Teacher etc .
It can be seen at the end of graph there are several categories of students , lets have a look at them separately .

In [18]:
stu = loan[loans.Occupation.str.contains("Student")==True]

In [19]:
stu.Occupation.value_counts()

In [20]:
len(stu.Occupation)/len(loans.Occupation)

We can see that 0.6% of the borrowers are students.
Students are potential borrower but still Prosper is not very famous among them. Company should make policies to encourage them to take loans.

## **4. Borrower State**
The two letter abbreviation of the state of the address of the borrower at the time the Listing was created.

In [21]:
plt.figure(figsize=(18,6))

sns.countplot(loans.BorrowerState,color = sns.color_palette()[0],order = loans.BorrowerState.value_counts().index);

The most popular state is California mostly because Prosper was founded. Other popular states include Florida, New York, Texas etc.

## **5. Listing Category**
The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans

In [23]:
#plt.figure(figsize=(18,6))

#label = [["Not Available","Debt Consolidation", "Home Improvement", "Business","Personal Loan", "Student Use", "Auto","Other", "Baby & Adoption","Boat", "Cosmetic Procedure", "Engagement Ring", "Green Loans","Household Expenses", "Large Purchases", "Medical/Dental", "Motorcycle", "RV", "Taxes", "Vacation", "Wedding Loans", "Other", "Not Applicable"]]

#ax = sns.countplot(loans.ListingCategory,color = sns.color_palette()[0]);
#ax.set_xticklabels(label, rotation='vertical', fontsize=10)
#plt.show()

Most popular reasons to take a loan are Debt Consolidation, Home Improvement, Buisiness and Personal loan among several other reasons. Though a lot of people are not comfortable in sharing their reasons which comes under 'Not Available' and 'Other' .

## **6. Employement Status**
The employment status of the borrower at the time they posted the listing.

In [24]:
plt.figure(figsize = (10,6))
sns.countplot(loans.EmploymentStatus ,color = sns.color_palette()[0]);
plt.show()

People who are not employed or do not have a stable job are not much into Prosper which is an obvious thing. Even self-employed also in less proportion.

## **7. Credit Grade and Prosper Rating**
The Prosper Rating assigned at the time the listing was created between AA - HR. In pre-2009 phase it was called credit grade, after 2009 it was called prosper rating.

In [25]:
fig, ax = plt.subplots(1,2 ,figsize=(25,8),sharey='row')
#  plt.subplots(2,2,)
fig.subplots_adjust(wspace=0.1)

x = sns.countplot(loans.CreditGrade ,ax=ax[0],color = sns.color_palette()[0],order =[ "NA","HR", "E", "D", "C", "B","A", "AA"])
x.title.set_text("Credit Grade - Grade before 2009")

y=sns.countplot(loans['ProsperRating (Alpha)'] ,ax=ax[1],color = sns.color_palette()[0],order =["HR", "E", "D", "C", "B","A", "AA"])
y.title.set_text("Prosper Rating - Grade after 2009")

fig.show()

These are the ratings provided by Prosper to its borrowers. Prosper has seven loan grades called Prosper Ratings: AA, A, B, C, D, E and HR where AA is the lowest risk down to HR which actually stands for high risk. Rates start at 5.99% for a 3-year AA loan up to 31.72% for an HR loan.

After the period of 2009 , we can see that majority of people belong to the high risk range, i.e. HR - B . From the above graphs we can see that the lowest risk category AA count has increased in Prosper Rating.

## Now, let's analyse the numerical variables one by one -

## **1. Credit Score**
The lower value representing the range of the borrower's credit score as provided by a consumer credit rating agency.
The upper value representing the range of the borrower's credit score as provided by a consumer credit rating agency.

In [26]:
loans.CreditScoreRangeLower.describe()

In [27]:
loans.CreditScoreRangeUpper.describe()

In [28]:
loans['CreditScore'] = (loans.CreditScoreRangeLower + loans.CreditScoreRangeUpper)/2

In [29]:
loans.CreditScore.describe()

There are two variables in the dataset CreditScoreRangeLower and CreditScoreRangeUpper . CreditScoreRangeLower is the lower value representing the range of the borrower's credit score as provided by a consumer credit rating agency, and CreditScoreRangeUpper is the upper value representing the range of the borrower's credit score as provided by a consumer credit rating agency. I converted them into a single variable called CreditScore for my analysis .

In [30]:
loans['CreditScore'].replace(np.nan, loans.CreditScore.mean(),inplace=True)

In [31]:
plt.figure(figsize = (10,6))
plt.hist(loans.CreditScore,45,edgecolor='black', linewidth=0.5);
plt.xlim(400,1000);

A majority of the users lie between the 600 and 800 mark, which are decent credit scores.

## **2. Stated Monthly Income**

In [32]:
# plt.hist(loans.StatedMonthlyIncome,45,edgecolor='black', linewidth=0.5);
plt.figure(figsize = (10,6))
plt.hist(loans.StatedMonthlyIncome,3000,edgecolor='black');
plt.xlim(0,40000);

In [33]:
loans.StatedMonthlyIncome.describe()

Max monthly income is 1750003 and minimum is 0. From the graph it can observed that a majority of the users lie between the 2500 and 7500 range, which makes complete sense because people with very high monthly income don't need to take loan and the ones with towards 0 are less likely to take loan because they may get into debt.

## **3. BorrowerAPR, BorrowerRate, LenderYield**
**BorrowerAPR** - The Borrower's Annual Percentage Rate (APR) for the loan.

**BorrowerRate** - The Borrower's interest rate for this loan.

**LenderYield** - The Lender yield on the loan. Lender yield is equal to the interest rate on the loan less the servicing fee.

In [34]:
loans['BorrowerAPR'].replace(np.nan, loans.BorrowerAPR.mean(),inplace=True)
loans['BorrowerRate'].replace(np.nan, loans.BorrowerRate.mean(),inplace=True)
loans['LenderYield'].replace(np.nan, loans.LenderYield.mean(),inplace=True)

Replaced the NaN values for the variables if present.

In [35]:
fig, ax = plt.subplots(3,1 ,figsize=(10,8),sharey='row')
#  plt.subplots(2,2,)
fig.subplots_adjust(hspace= 0.5)

ax[0].hist(loans.BorrowerAPR,30,edgecolor='black', linewidth=0.5)
ax[0].set_title("Borrower APR")


ax[1].hist(loans.BorrowerRate,30,edgecolor='black', linewidth=0.5)
ax[1].set_title("Borrower Rate")

ax[2].hist(loans.LenderYield,30,edgecolor='black', linewidth=0.5)
ax[2].set_title("Lender Yield")

fig.show()

The bulk of the loans seem to be 0.08 to 0.25 , which coincides with the credit rating histograms that show that the majority of the users are in the middle of the risk ratings. The lender yield and BorrowerRate plots are similar to borrower APR because they all represent interest rates.
The peak count is slightly lower than the one in the borrower APR plot, and I think it is because of the losses that are made when borrowers is penalized for default or charged off loans.

## **4. Loan Original Amount**

In [36]:
plt.figure(figsize = (7,5))
plt.hist(loans.LoanOriginalAmount,35,edgecolor='black');
plt.title("Loan OriginalAmount")

Most of the loan amount are in the range of 0 - 10000 . One thing that should be noted is that values at multiple of 5000 are more than the other number which is obvious as people have a tendency to go for numbers that are easy to remember .

## **5. Listing Creation Date**


In [37]:
loans['ListingCreationDate'] = pd.to_datetime(loans['ListingCreationDate'])
loans['year'] = loans['ListingCreationDate'].dt.year

In [38]:
loans.year.value_counts()

In [39]:
plt.figure(figsize = (10,6))
sns.countplot(loans.year, color = sns.color_palette()[0] )
plt.title("Loan creation year count");

Loan creation dropped to a significant level in 2009 , this may be due to the crisis at that time .

## Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

* The CreditScore when plotted had outlier at zero and some beyond 900 , so I set limits on x-axis( which is xlim(400, 900)) to get the correct plot.

* The StatedMonthlyIncome when plotted had outlier beyond 40000 and so i had to set limits on x-axis( which is plt.xlim(0,40000) ) to get the correct plot.

* Also to plot histograms of numerical variables , I had to remove NaN values from the columns BorrowerAPR, BorrowerRate, LenderYield and CreditScore, and so I replaced NaN with their respective mean .

# **Bivariate Exploration**

In this section I would like to see variation of loan amount with different variables like Income Range , Prosper Rating and year. This would help me in identifying the relationship between class of the people and their role in contributing total loan sum in prosper.

Also, I am keen to know about the delinquent borrowers and how their number is varying with year and the their reason of getting into debt.

I'll be analysing some other relationships like BorrowerAPR and CreditScore with Prosper Rating as this would give me insight about borrower's behaviour as per their rating given on Prosper.com

In [40]:
loann = loans[[ 'Term', 'LoanOriginalAmount','BorrowerAPR', 'BorrowerRate', 'LenderYield', 'LoanStatus' ,'ListingCategory', 'year' ,'IncomeRange', 'StatedMonthlyIncome', 'EmploymentStatus', 'BorrowerState','DebtToIncomeRatio' ,'Occupation', 'CreditGrade', 'ProsperRating (Alpha)', 'ProsperScore', 'CreditScore']]


In [41]:
loann.shape

## **1. Loan Amount variation with different Income Range**

In [42]:
loann.groupby('IncomeRange').LoanOriginalAmount.size().plot(kind='barh',color = sns.color_palette()[0]);
plt.xlabel("Loan count");
plt.ylabel("Income Range");
plt.title("Loan Original Amount count of each Income Range");

People of middle income range 25,000 to 100,000 takes loan in more number while the ones who realy needs it , the people in range 0 and 1-24,999 are in less number. Lets look at one more plot related to sum of loan amount of each range .

In [43]:
loann.groupby('IncomeRange').LoanOriginalAmount.sum().plot(kind='barh',color = sns.color_palette()[0]);
plt.xlabel("Loan Original Amount sum");
plt.ylabel("Income Range");
plt.title("Loan Original Amount sum of each Income Range");

This graph also show that the people in range of 0 - 25000 are not taking loans or maybe are unable to get it . This may be due to basic salary requirements of the organization to grant a loan, which the low income range don’t meet easily and hence might be unable to get a loan.

## **2. Loan Original Amount variation with Prosper Rating**

In [44]:
base_color = sns.color_palette()[0]
plt.figure(figsize=(10,6))
sns.boxplot(data = loann, x = 'ProsperRating (Alpha)', y = 'LoanOriginalAmount', color = base_color,order =["HR", "E", "D", "C", "B","A", "AA"]);
plt.title("Loan Original Amount of each Prosper Rating Type");

The groups of higher risk took lower loan amount and the groups C, B and A looks like have the same median loan amounts. The lowest risk group AA shows highest median loan amount.

The result is as expected as people with lower risk have the tendency to take mnore amount and vice versa .

## **3. Borrower APR variation with Prosper Rating**

In [45]:
plt.figure(figsize=(10,6))
sns.boxplot(data = loann, x = 'ProsperRating (Alpha)', y = 'BorrowerAPR', color = base_color,order =["HR", "E", "D", "C", "B","A", "AA"]);
plt.title("Borrower APR variation of each Prosper Rating Type");

BorrowerAPR is the Borrower's Annual Percentage Rate (APR) for the loan. As we get into low risk range the APR reduce drastically. Also number of outliers also decreases down the line.

## **4. Credit Score variation with Prosper Rating**

In [46]:
plt.figure(figsize=(10,6))
sns.boxplot(data = loann, x = 'ProsperRating (Alpha)', y = 'CreditScore', color = base_color,order =["HR", "E", "D", "C", "B","A", "AA"]);


The plot suggests that Prosper Rating has a direct relation with Credit Score . As the borrower move into lower risk range , his credit score also increases.

One thing to be noted here is that Rating HR and D seem to be identical , in terms of IQR and median. Let's have a look at the statistics -

In [47]:
loann.groupby('ProsperRating (Alpha)').CreditScore.describe()

Median, min, max, 25% and 75% of D and HR are same , mean vary by a slight difference for both.

## **5. Loan Amount variation with Year**

In [48]:
plt.figure(figsize=(10,6))
sns.barplot(data = loann, x = 'year', y = 'LoanOriginalAmount', color = base_color,ci = None);

There was decrease in year 2008 and after it loan amount increased.

In [49]:
plt.figure(figsize=(10,6))
sns.violinplot(data = loann, x = 'year', y = 'LoanOriginalAmount', color = base_color,inner='quartile');

In [50]:
loann.groupby('year').LoanOriginalAmount.describe()

One thing to be noted in the violin plot is that till 2012 , the plot of each year is of max width between 0 to 5000, which suggests that most of the loan amount belong to this range and by lookig at quartile lines inside the plot and the actual values , this is confirmed.

## **6. Delinquent Borrowers**

In [51]:
loann.LoanStatus.value_counts()

In [52]:
def Delq (row):
    if row == 'Chargedoff' or row == 'Defaulted' or row == 'Past Due (61-90 days)' or row == 'Past Due (91-120 days)' or row == 'Past Due (61-90 days)' or row == 'Past Due (>120 days)' or row == 'Cancelled':
        return 'Delinquent'
    return 'Good'
    
loann['BorrowerType'] = loann.apply(lambda row: Delq(row.LoanStatus),axis=1)

In [53]:
loann.BorrowerType.value_counts()

In [54]:
plt.figure(figsize=(13,8))
sns.countplot(data = loann, x = 'year', hue = 'BorrowerType', palette = 'Blues')

One thing to see here is that there's numberof both delinquent and good borrowers follow the same trend except in 2013.
In 2013 number of good borrowers increased drastically while on he other hand number of delinquent borrowers decreased .
In 2014 both the numbers decreased significantly .

## **Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?**

I have mentioned my analysis of each graph right after it.

## **Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?**

NO

## **What was the strongest relationship you found?**

The stongest relation showed by the plots was the one between Loan Amounts vs YEAR . This plot showed that the quanity of loans granted increased strongly between 2009 to 2014 .

# **Multivariate Exploration**

In this section my main focus is to analyse the behaviour of delinquent borrowers with variation in factors like credit score, loan amount , year,debt to income ratio etc.

One thing I would like to analyse is how different type of employee types fall into the category of delinquent borrower and how their loan amount varies from one another.

In [55]:
corr = loann.corr()
#Plot figsize
fig, ax = plt.subplots(figsize=(10, 10))
#Generate Color Map, red & blue
colormap = sns.diverging_palette(220, 10, as_cmap=True)
#Generate Heat Map, allow annotations and place floats in map
sns.heatmap(corr, cmap=colormap, annot=True, fmt=".2f")
#Apply xticks
plt.xticks(range(len(corr.columns)), corr.columns);
#Apply yticks
plt.yticks(range(len(corr.columns)), corr.columns)
#show plot
plt.show()

From the above correlation plot, it can be concluded that as such there are no two variables that are strongly correlated , other than 'BorrowerAPR', 'BorrowerRate' and 'LenderYield' which are basically not significantly different from one another .
These three are negatively correlated to ProsperScore , but again there is no strong correlation.

# **Prosper Data EDA Questions**

**Introduction**

1. This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others.

2. The most critical tool in a P2P lending organization is its ability to
assess a borrower’s creditworthiness as accurate as possible. Here, I
am going to asses the tools used and to see if it is accurate in determining a person’s creditworthiness mainly Credit Grade and Prosper
Score

Action: **presentation three main focuses are on the most critical part of Peer-to-Peer Lending which is CreditGrade , BorrowerRate and Lender Yield not to mention the custom built risk assesment tool called ProsperScore to assess the creditworthiness of the borrower.**

Research Question 1 : **What are the most number of borrowers Credit
Grade?** 

In [56]:
sns.countplot(x='CreditGrade', data=loann)
plt.show()

In [57]:
fig = plt.figure(figsize=(12,6))
sns.countplot(data = loann,y = 'CreditGrade', order = loann["CreditGrade"].value_counts().index[:10],palette = 'magma_r')
plt.show()


As we can see the countplots of the Credit Grades of the loan dataframe, the borrowers with the Credit Grade of C are the most in number, followed by D.

Research Question 2 : **Since there are so much low Credit Grade such
as C and D , does it lead to a higher amount of deliquency?** 

In [58]:
loann.LoanStatus.value_counts()

The loan status of the borrowers are converted by me to good or delinquent. 

In [59]:
loann.BorrowerType.value_counts()

This shows that people who were having delays or problems with their loans were automatically labelled as delinquents and so their credit grade will be shown to be affected in terms of being a C or D instead of AA,A and B

Research Question 3 : **What is the highest number of BorrowerRate?** 

In [60]:
sns.histplot(loann.BorrowerRate,edgecolor='black', linewidth=0.5, bins =15)

In [61]:
loann['BorrowerRate'].max()

In [62]:
loann['BorrowerRate'].value_counts()

From the univariate analysis of the BorrowerRate feature, we can conclude that the borrower rate is highest at 0.4975 but the majority of the borrower rate values lie in the 0.13-0.35 range, with the most number of values being approx 0.15.

Research Question 4 : **Since the highest number of Borrower Rate
is between 0.1 and 0.2, does the highest number of Lender Yield is
between 0.1 and 0.2?** 

In [63]:
fig, ax = plt.subplots(3,1 ,figsize=(10,8),sharey='row')
#  plt.subplots(2,2,)
fig.subplots_adjust(hspace= 0.5)

ax[0].hist(loans.BorrowerAPR,30,edgecolor='black', linewidth=0.5)
ax[0].set_title("Borrower APR")


ax[1].hist(loans.BorrowerRate,30,edgecolor='black', linewidth=0.5)
ax[1].set_title("Borrower Rate")

ax[2].hist(loans.LenderYield,30,edgecolor='black', linewidth=0.5)
ax[2].set_title("Lender Yield")

fig.show()

The highest number of Lender Yield seems to be between 0.1 and 0.2, just like the Borrower Rate. The bulk of the loans seem to be 0.08 to 0.25 , which coincides with the credit rating histograms that show that the majority of the users are in the middle of the risk ratings. The lender yield and BorrowerRate plots are similar to borrower APR because they all represent interest rates.
The peak count is slightly lower than the one in the borrower APR plot, and I think it is because of the losses that are made when borrowers is penalized for default or charged off loans.

Research Question 5 : **Is the Credit Grade really accurate? Does
higher Credit Grade leads to higher Monthly Loan Payment? As for
Higher Credit Grade we mean from Grade AA to B** 

In [64]:
loan.MonthlyLoanPayment

In [65]:
fig = plt.figure(figsize=(12,6))
sns.barplot(x='CreditGrade',y='MonthlyLoanPayment',data=loan) 

As per the barplot, we can conclude that higher Credit grade (like AA, A and B) lead to higher monthly loan payment. This value reaches approx from 270-320 whereas for the lower grades, the monthly loan payment is also much lower.

Research Question 6 : **Here we look at the Completed Loan Status
and Defaulted Rate to determine the accuracy of Credit Grade.**

In [66]:
corr = loann.corr()
#Plot figsize
fig, ax = plt.subplots(figsize=(10, 10))
#Generate Color Map, red & blue
colormap = sns.diverging_palette(220, 10, as_cmap=True)
#Generate Heat Map, allow annotations and place floats in map
sns.heatmap(corr, cmap=colormap, annot=True, fmt=".2f")
#Apply xticks
plt.xticks(range(len(corr.columns)), corr.columns);
#Apply yticks
plt.yticks(range(len(corr.columns)), corr.columns)
#show plot
plt.show()

Research Question 7 : **Now we know the Credit Grade is accurate
and is a tool that is used by the organization in determining the
person’s creditworthiness. Now we need to understand does the
ProsperScore, the custom built risk assesment system is being used
in determing borrower’s rate?**

From a theoretical standpoint, if the higher ProsperScore leads to lower Borrower Rate and Borrower Annual Percentage Rate that means the Prosper Score is being used alongside the Credit Grade in determing a person’s creditworthiness.

**Business Insight**

Since the most important assest of a P2P lending Organization is its ability in using its tool to determine a borrower’s creditworthiness as accurate as possible. The organization would be more confident to market its organization as a great investment for investor to invest in hence leading to more borrower and higher market capitilization and boost revenue growth.

# **Feature Engineering for the selected features**

In [67]:
from sklearn.preprocessing import OneHotEncoder

#creating instance of one-hot-encoder
encoder = OneHotEncoder(handle_unknown='ignore')

#perform one-hot encoding on 'LoanStatus' column 
encoder_loann = pd.DataFrame(encoder.fit_transform(loann[['LoanStatus']]).toarray())

#merge one-hot encoded columns back with original DataFrame
final_loann = loann.join(encoder_loann)

#view final df
print(final_loann)

First, we will check the original values for the LoanStatus columns in order to have a better idea what to replace with later on after one-hot encoding.

In [68]:
final_loann.LoanStatus.value_counts()

Checking the dataframe to see how the one-hot encoded columns look like

In [69]:
final_loann.head()

Notice that 12 new columns were added to the DataFrame since the original ‘LoanStatus’ column contained 12 unique values.

## **Drop the Original Categorical Variable**

Lastly, we can drop the original ‘team’ variable from the DataFrame since we no longer need it:

In [70]:
final_loann.drop('LoanStatus', axis=1, inplace=True)

In [71]:
final_loann.head()

We can see that the LoanStatus column has been dropped from the dataframe.

### We could also rename the columns of the final DataFrame to make them easier to read

In [72]:
print(final_loann)

In [73]:
#rename columns
final_loann.columns = ['Term', 'LoanOriginalAmount','BorrowerAPR', 'BorrowerRate', 'LenderYield','ListingCategory', 'year' ,'IncomeRange', 'StatedMonthlyIncome', 'EmploymentStatus', 'BorrowerState','DebtToIncomeRatio' ,'Occupation', 'CreditGrade', 'ProsperRating (Alpha)', 'ProsperScore', 'CreditScore','BorrowerType','LoanStatus-Current','LoanStatus-Completed','LoanStatus-ChargedOff','LoanStatus-Defaulted','LoanStatus-1-15past','LoanStatus-31-60past','LoanStatus-61-90past','LoanStatus-91-120past','LoanStatus-16-30past','LoanStatus-Inprogress','LoanStatus>120','LoanStatus-Cancelled']

In [74]:
final_loann.columns

In [75]:
final_loann.head()

The one-hot encoding is complete and we can now feed this pandas DataFrame into any machine learning algorithm that we’d like.