# Part I - Prosper Loan Data
## by Samuel Aderemi

## Introduction
> The Dataset I used for this visualization project is the ProsperLoans dataset. It comprises of 81 columns and over 113,000 records.
It contains information about each loan listed with Prosper dating since 2005 to 2015. Some of the information it holds are _borrower's rate, listing category, term of loan, loan status, and many more_. I will focus on exploring and visualizing the attributes of the different loan statutes enlisted.


## Preliminary Wrangling


In [None]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
loans  =  pd.read_csv('prosperLoanData.csv')

In [None]:
loans.shape


In [None]:
loans.info()

In [None]:
loans.describe()

### What is the structure of your dataset?

> 113,937 rows and 81 columns

### What is/are the main feature(s) of interest in your dataset?

> `LoanStatus` & `ProsperRating (Alpha)`

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

<li>BorrowerRate</li>
<li>StatedMonthlyIncome</li>
<li>ListingCategory</li>
<li>EmploymentDuration</li>
<li>IncomeRange</li>
<li>TotalProsperLoans</li>
<li>IncomeRange</li>


##### Some Data Exploration & Cleaning

Creating copy of Loans dataset

In [None]:
loans_sub  = loans.copy()
loans_sub = loans_sub[["ListingKey", "LenderYield", "EmploymentStatus", "ListingCategory (numeric)", "StatedMonthlyIncome", "AmountDelinquent", "IncomeRange", "TotalProsperLoans", "EmploymentStatusDuration", "LoanStatus", "CreditGrade", "Term", "BorrowerRate", 'ProsperRating (Alpha)']]

In [None]:
loans_sub[loans_sub.BorrowerRate == 0]

For all entries of zero borrower rate, lender yield is negative

In [None]:
loans_sub.isna().sum()

In [None]:
loans_sub.ListingKey.duplicated().sum()

Removing Duplicated records from loans dataset

In [None]:
loans_sub = loans_sub[~(loans_sub.ListingKey.duplicated())]
loans_sub.ListingKey.duplicated().sum()

Creating employment duration column in years

In [None]:
loans_sub['EmploymentStatusDuration(years)'] = loans_sub.EmploymentStatusDuration / 12

Creating Listing Category column encoded as the reason for loan

In [None]:
def rename_listing_cat_values(loans_df):
    """Renaming of listing category to original listings"""

    values_dict = {0:'Not Available', 1:'Debt Consolidation', 2:'Home Improvement', 3:'Business',
                    4:'Personal Loan', 5:'Student Use', 6:'Auto', 7:'Other',
                    8:'Baby&Adoption', 9:'Boat', 10:'Cosmetic Procedure', 11:'Engagement Ring',
                    12:'Green Loans', 13:'Household Expenses', 14:'Large Purchases', 15:'Medical/Dental',
                    16:'Motorcycle', 17:'RV', 18:'Taxes', 19:'Vacation', 20:'Wedding Loans'}

    for value in values_dict:
        if loans_df['ListingCategory (numeric)'] == value:
            return values_dict[value]
        


loans_sub['ListingCategory'] = loans_sub.apply(rename_listing_cat_values, axis=1)

loans_sub.info()

Making of Ordinal Categories

In [None]:
ordinal_var_dict = {'IncomeRange': ['Not employed', 'Not displayed', '$0', '$1-24,999', '$25,000-49,999', '$50,000-74,999', '$75,000-99,999', '$100,000+'],
                    'ProsperRating (Alpha)': ['AA', 'A', 'B', 'C', 'D', 'E', 'HR'],
                    'CreditGrade': ['AA', 'A', 'B', 'C', 'D', 'E', 'HR', 'NC']
                        }

for var in ordinal_var_dict:
    ordered_var = pd.api.types.CategoricalDtype(ordered=True, categories=ordinal_var_dict[var])

    loans_sub[var] = loans_sub[var].astype(ordered_var)

Making of more categorical columns

In [None]:
loans_sub.EmploymentStatus = loans_sub.EmploymentStatus.astype('category')
loans_sub.ListingCategory = loans_sub.ListingCategory.astype('category')

## Univariate Exploration

##### What people generally apply for the Propser Loans and what are their reasons

In [None]:
order_type = loans_sub.EmploymentStatus.value_counts().index
base_color = sns.color_palette()[0]

figure = plt.figure(figsize=[12,8]);

sns.countplot(data=loans_sub, x='EmploymentStatus', color=base_color, order=order_type);

This shows that majority of the people who enlist for Proper Loans are of the working class majority

In [None]:
order_type = loans_sub.ListingCategory.value_counts().index

figure = plt.figure(figsize=[12,8])

sns.countplot(data=loans_sub, y='ListingCategory', color=base_color, order=order_type)

###### More than half of the entire listing seek to pay off debts owed, and incure debt in the process. 

In [None]:
order_type = loans_sub.IncomeRange.value_counts().index

figure = plt.figure(figsize=[12,8]);

sns.countplot(data=loans_sub, x='IncomeRange', color=base_color, order=order_type);

###### It is more of the middle class who enlist for loans

In [None]:
sns.displot(data=loans_sub, x='EmploymentStatusDuration(years)');

##### General Structure of Prosper Loans

In [None]:
sorted_ratings = loans_sub['ProsperRating (Alpha)'].value_counts()

figure = plt.figure(figsize=[10,6])

plt.pie(sorted_ratings, labels=sorted_ratings.index, startangle=90, counterclock=False)

plt.title('Prosper Rating')
plt.axis('square');
plt.legend(loc=4);

This shows most of the loans enlisted with Prosper (from 2009) are of intermediate risk i.e between low risk (AA) to highly risky (HR)

In [None]:
sorted_ratings = loans_sub['CreditGrade'].value_counts()

figure = plt.figure(figsize=[10,6])

plt.pie(sorted_ratings, labels=sorted_ratings.index, startangle=90, counterclock=False)

plt.axis('square');
plt.legend(loc=4);

In [None]:
order_type = loans_sub.LoanStatus.value_counts().index

figure = plt.figure(figsize=[12,8]);

sns.countplot(data=loans_sub, y='LoanStatus', color=base_color, order=order_type);

Good news is that more than 80% of Prosper Loans are being paid

In [None]:
log_binsize = 0.05
bins = 10 ** np.arange(0, np.log(loans_sub['StatedMonthlyIncome'].max())+log_binsize, log_binsize)

plt.figure(figsize=[8,5])
plt.hist(data=loans_sub, x='StatedMonthlyIncome', bins=bins)
plt.xscale('log')
plt.xlim([80, loans_sub['StatedMonthlyIncome'].max()+log_binsize])
plt.xticks([100, 200, 500, 1e3, 2e3, 5e3, 1e4, 2e4, 5e4, 1e5, 2e5, 5e5, 1e6, 2e6], [100, 200, 500, '1K', '2K', '5K', '1OK', '20K', '50K', '100K', '200K', '500K', '1M', '2M'])
plt.xlabel('Monthly Income ($)')
plt.show()

Monthly Income is akwardly skewed on both sides, with with a modal income neighbouring $5,000

In [None]:
sns.countplot(data=loans_sub, x='TotalProsperLoans', color=base_color);
# plt.xlabel('No. of Prosper Loans')



>**Rubric Tip**: Visualizations should depict the data appropriately so that the plots are easily interpretable. You should choose an appropriate plot type, data encodings, and formatting as needed. The formatting may include setting/adding the title, labels, legend, and comments. Also, do not overplot or incorrectly plot ordinal data.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Exploring and Visualizing the sect of people who enlist for the Loans, the working class are the most frequent. Those who earn in the range of 25K to 75K account for the highest applications received. 
> <p>Prosper also receive most applications from the  working class who have spent between 0 to 10 years in employment, 20 years and above don't make much applications as compared with the previously describe. The unemployed and 0 earners also don't make significant applications </p>
> <p>Debt consolidation is the major driver for borrowers of the Prosper loan, business intriguingly comes in 5th position</p>
><p>Prosper as of effect from 2009 when she started using the Prosper Rating system, assessed most of it's investment as intermediate risk. The Prosper Rating pie-chart shows lowest risk, AA and highly risky, HR have the smallest chunks while ratings E, D, A, B, C have the biggest portions in that order. The previously used grading system, Credit Grade however, did not follow this pattern, with lowest risk, AA and highly risky HR coming in before A</p>
> <p>Number of previously taken loans dwindles down from 1 to 8. The stated monthly income shows a degree of skewness, with 75 percentile below 10K and maximum value above a million. A log tranformation was applied in order to make a decent plot. Earners with the range of 4K to 6K constitute of the highest applicants</p>

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> The monthly income had values has from 0 percentile to 75 percentile less than 10K with few outliers above 100K
<li>The Employment duration had to be re-encoded into years instead of months for proper scaling</li>
<li>The numeric listing category had to be sorted into a categorical column represented by the exact reason for applying instead of numeric representations</li>
<li>IncomeRange, ProsperRange and CreditGrade were categorized ordinally for better clearer visualization</li>
<li>Employment status also was transformed to a categorical column</li>

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

In [None]:
loans_sub_samp = loans_sub.sample(n=2000, replace = False)


def boxgrid(x, y, **kwargs):
    """ Utilizes Seaborn's boxplot to make ploting easier"""
    default_color = sns.color_palette()[0]
    sns.boxplot(x=x, y=y, color=default_color)

plt.figure(figsize = [10, 20])
g = sns.PairGrid(data = loans_sub_samp, x_vars = ['BorrowerRate', 'EmploymentStatusDuration(years)'], y_vars = 'LoanStatus', height=6, aspect=1)
g.map(boxgrid)
plt.show();

People who have completed their payments and currently making payments have much outliers above 20 years of employment

In [None]:
plt.figure(figsize = [20, 15])
g = sns.PairGrid(data = loans_sub, y_vars = ['StatedMonthlyIncome'], x_vars = 'LoanStatus', height=6, aspect=3)
g.map(boxgrid)
plt.yscale('log')
plt.yticks([100, 200, 500, 1e3, 2e3, 5e3, 1e4, 2e4, 5e4, 1e5, 2e5, 5e5, 1e6], [100, 200, 500, '1K', '2K', '5K', '1OK', '20K', '50K', '100K',
                  '200K', '500K', '1M'])

plt.xticks(rotation=15);
plt.ylim(80, 1e6);
plt.ylabel('StatedMonthlyIncome ($)');
plt.show();

In [None]:
g = sns.PairGrid(data = loans_sub, y_vars = ['StatedMonthlyIncome'], x_vars = 'LoanStatus', height=6, aspect=3)
g.map(boxgrid)
plt.yscale('log')
plt.yticks([2e3, 3e3, 4e3, 5e3, 6e3, 1e4], ['2K', '3K', '4K', '5K', '6K', '1OK'])

plt.xticks(rotation=15)
plt.ylim(2e3, 1e4)
plt.ylabel('StatedMonthlyIncome ($)')
plt.show();

Current and final payments have median Monthly Income of slightly above 5K. Completed has an income of between 4K and 5K but has outliers approaching a 1M monthly income. Cancelled payments have just below 3K, Defaulted just below 4K. Past due payments have stated income to be with steady then declining within the range of slighlty above 4K to 3.5K

In [None]:
g = sns.PairGrid(data = loans_sub, y_vars = ['StatedMonthlyIncome'], x_vars = 'CreditGrade', height=6, aspect=3)
g.map(boxgrid)
plt.yscale('log')
plt.yticks([100, 200, 500, 1e3, 2e3, 5e3, 1e4, 2e4, 5e4, 1e5, 2e5, 5e5, 1e6], [100, 200, 500, '1K', '2K', '5K', '1OK', '20K', '50K', '100K',
                  '200K', '500K', '1M'])

plt.ylim(80, 1e6)
plt.ylabel('StatedMonthlyIncome ($)')
plt.show();

The median monthly Income of High Risk and No Credit investments is the lowest of the distribution in that order with the lowest outliers on monthly income

In [None]:
plt.figure(figsize=[20,10])
sns.countplot(data=loans_sub, x='LoanStatus', hue='CreditGrade', palette='Blues')
plt.xticks(rotation=15);

As clear shown, the C and HR show negative tendencies of paying up loan. The Defaulted has a peak of a HR loan while the Charged off has a 
pen ultimate peak of HR also. For the completed loan status however, the HR  has a low count, showing once a again that they are likely to not 
pay up than they do. low risk to average risk loans AA, A, B, C,D show that they complete their loans more often than not. The C grade which is of the 
average risk show a neutral loan Status

In [None]:
plt.figure(figsize=[20,10])
sns.countplot(data=loans_sub, x='LoanStatus', hue='ProsperRating (Alpha)', palette='Blues')
plt.xticks(rotation=15);

In [None]:
plt.figure(figsize=[20,10])
sns.countplot(data=loans_sub, x='LoanStatus', hue='ProsperRating (Alpha)', palette='Blues')
plt.xticks(rotation=15);
plt.ylim(1,400)

For the Loan Statutes of Completed, Current and Final Payments in Progress are all favoured by intermediate risks. The Lowest risk AA is of low count in all Loan Status, while more risky ratings D, E, HR are high shooters for all other loan statutes that are defaulting in payments

#

Loan Status, Credit Grade and Prosper Rating relationship with Borrower Rate

In [None]:
plt.figure(figsize=[12,8])
sns.barplot(data=loans_sub, x='BorrowerRate', y='LoanStatus', color=sns.color_palette()[5])

The Chart above shows the fact that loans with lower borrower rate have a higher chance of being paid back and in due time
Loans with borrower rates above 0.20 are more likely to default on their payments and prolong the payment period than is due.

In [None]:
plt.figure(figsize=[12,8])
sns.barplot(data=loans_sub, x='CreditGrade', y='BorrowerRate', palette=sns.color_palette('viridis', 9))

The Chart shows that borrower rate is a major determinant of credit grade

In [None]:
plt.figure(figsize=[12,8])
sns.barplot(data=loans_sub, x='ProsperRating (Alpha)', y='BorrowerRate', palette=sns.color_palette('viridis', 9))

This proves again that borrower rate is the major determinant of Prosper's Rating

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> `Current` and `final payments` have median Monthly Income of slightly above 5K. `Completed` has an income of between 4K and 5K but has outliers approaching a 1M monthly income. `Cancelled` payments have just below 3K, `Defaulted` just below 4K. `Past due` payments have stated income to be with steady then declining within the range of slighlty above 4K to 3.5K
> People who have completed their payments and currently making payments have much outliers above 20 years of employment
> Loans with lower borrower rate have a higher chance of being paid back and in due time. Loans with borrower rates above 0.20 are more likely to default on their payments and prolong the payment period than is due.
> For the Loan Statutes of `Completed`, `Current` and `Final Payments` in Progress are all favoured by intermediate risks. The Lowest risk AA is of low count in all Loan Status, while more risky ratings D, E, HR are high shooters for all other loan statutes that are defaulting in payments
> C and HR show negative tendencies of paying up loan. The Defaulted has a peak of a HR loan while the Charged off has a 
pen ultimate peak of HR also. For the completed loan status however, the HR  has a low count, showing once a again that they are likely to not 
pay up than they do. low risk to average risk loans AA, A, B, C,D show that they complete their loans more often than not. The C grade which is of the 
average risk show a neutral loan Status

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> The median monthly Income of High Risk and No Credit investments is the lowest of the distribution in that order and also consist of  lowest outliers of monthly income
> `BorrowerRate` shows uniform correlation with `CreditGrade` and `ProserRating`. Mininum to maximum borrower rates climbing from `AA` to `HR`

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

In [None]:
credit_grade_df = loans_sub.loc[loans_sub['LoanStatus'].isin(['Cancelled', 'Chargedoff', 'Completed', 'Defaulted'])]
borrow_rate_means = credit_grade_df.groupby(['LoanStatus', 'CreditGrade']).mean()['BorrowerRate']
borrow_rate_means = borrow_rate_means.reset_index(name = 'BorrowRate_avg')
borrow_rate_means = borrow_rate_means.pivot(index = 'LoanStatus', columns = 'CreditGrade',
                            values = 'BorrowRate_avg')
sns.heatmap(borrow_rate_means, annot = True, fmt = '.3f',
           cbar_kws = {'label' : 'mean(borrower_rate_avg)'})

In [None]:
plt.figure(figsize=[8,8])

borrow_rate_means = loans_sub.groupby(['LoanStatus', 'ProsperRating (Alpha)']).mean()['BorrowerRate']
borrow_rate_means = borrow_rate_means.reset_index(name = 'BorrowRate_avg')
borrow_rate_means = borrow_rate_means.pivot(index = 'LoanStatus', columns = 'ProsperRating (Alpha)',
                            values = 'BorrowRate_avg')
sns.heatmap(borrow_rate_means, annot = True, fmt = '.3f',
           cbar_kws = {'label' : 'mean(borrower_rate)'})

The above two Multivariate heatmaps show that Predicted Highly Risky Loans with a high enough borrower rate have a greater tendency
of a not paying up on the loan borrowed while those who have a low borrower rate and little to medium risk involved proved to be faithful loan payers

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Low borrower's rate account for loan statuses with minimum risk while high borrower rates account for investment with higher risk involved

### Were there any interesting or surprising interactions between features?

> Higher mean borrower rates accounted for loan payment statutes which are either defaulting or defaulted, with values increasing across the trend of risk increasing risk involved

## Conclusions
> In exploring and visualizing of this data it was discovered that the working middle class are major applicants of Prosper Loans usually with working experience of not more than 15 years. The major driver for applying for these loans are debt consolidations, unstated reasons, home improvements and finally business related. Prosper has a system of rating each loan application it receives, and it generally consists of intermediate risk investments. Loan applications with $5K as monthly income have a higher tendency of paying up their borrowed loans.
Futher insights reveal that borrower rate is a major determinant of both Loan Status and Prosper rating. The lower the rate the more likely the loan will be paid and in due time, in likewise manner the a lower risk score will be assigned, the converse also is very much true.
