## Investigating Lender Effective Yield
### by Joel Hayhow

## Effective Yield

> Effective yield is a measure of the return on a loan for the lender. It is a very reliable metric as it takes into account the compound interest upon a loan. The goal of this project was therefore to take several attributes of loans and work out what relationship they have, if any, to effective yield. Effective yield is calculated using the following formula: 
![image.png](attachment:image.png)

Attribution: https://www.wallstreetmojo.com/effective-yield/ (Accessed 02/11/2021)

## Outcomes

> Key Insight 1: There is very little correlation between Prosper Loan Rating, credit score, total fee per loan paid by investors, and the effective yield per loan. This is surprising as Prosper Rating is the rating the company actually gives each loan.

> Key Insight 2: The standard deviation of effective yield correlates negatively and strongly with both Prosper Rating and credit score.

> Key Insight 3: There is a strong positive correlation between the original amount borrowed and the monthly loan repayment. This correlation was investigated for different employment statuses. It was strongest for borrowers who were part-time or retired. There was also a strong positive correlation, but a greater spread, for full-time borrowers. The conclusion is that part-time or retired borrowers accept loans under a smaller range of conditions than those in full-time work. 


## Dataset Overview

> The dataset chosen is ProsperLoanData. It contains nearly 114000 loans borrowed plus variables associated with each. There are 81 variables given for each loan. The variables investigated were:
1. Effective Yield (described above).
2. Prosper Rating - a numeric rating of a loan's reliability, calculated by Prosper, a company in the peer-to-peer lending industry.
3. Credit Scores - both the upper and lower credit scores for borrowers.
4. Original Loan Amount.
5. Monthly loan repayments.
6. Employment status.
7. Total investor fee (the sum of collection fees and service fees paid by investors), which indicates the fees paid by investors in a given loan.

A full description of each variable can be found here: https://docs.google.com/spreadsheets/d/1gDyi_L4UvIrLTEC6Wri5nbaMmkGmLQBk-Yx3z0XDEtI/edit#gid=0

During the cleaning process, columns with more than half their values missing were removed. A new variable (total investor fee) was created by summing the collection and service fees. These were negative in the dataset but were converte to positive values for the visualisations.

This description was provided by Udacity.

The dataset was also sourced from Udacity.

In [None]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

# suppress warnings from final output
import warnings
warnings.simplefilter("ignore")

In [None]:
# load in the dataset into a pandas dataframe
dataset = pd.read_csv('prosperloanData.csv')

> Note that the above cells have been set as "Skip"-type slides. That means
that when the notebook is rendered as http slides, those cells won't show up.

## Distribution of Lender Yield

> The lender yield distribution was explored first. Removing outliers (outliers were any yield > 0.40%) and plotting with a relatively high resolution of 50 bins showed bimodal distribution. The peaks are around 0.15% and 0.30%. At 0.30% the peak is especially high in the histogram visualization - where the datapoints are plotted regardless of the contribution of the other datapoints.

![image-2.png](attachment:image-2.png)


## Lender Yield Distribution (part 2)

> The lender yield was also plotted as a kernel density estimate (KDE). A KDE plots all the data by summing the contributions from each individual data point. It therefore takes into account the rest of the dataset. The visualisation shows that using a KDE, the second peak (around 0.30%) is much smaller than in the histogram. However the distribution is still bimodal, indicating that 0.30% is a significant peak nevertheless.

![image-2.png](attachment:image-2.png)

## Distribution of Prosper Rating

> The Prosper Rating (calculated loan rating) was plotted as a bar chart, and appears to follow a bell curve (roughly). It ranges from 1.00 to 7.00, with 1.00 being a low rating, meaning the loan is rated as high-risk. 7.00 indicates a loan which is very low-risk. Most loans fall in the middle (low-risk) category, around the 4.00 rating. The distribution is slightly skewed right, towards the low-risk loans, showing more loans lie in the less risky half of the distribution.

![image-2.png](attachment:image-2.png)

## Effective Yield and Credit Score

> There is a clear negative correlation between the *mean* effective yield of a group and its credit score. This would suggest that increasing credit score leads to lower return for the lender. However, it actually shows the mean is not a representative measure of return. In the plot below it is the interquartile range and length of the boxplots which decreases with increasing credit score. The uncertainty of the yield decreases with increasing credit score. Furthermore, outliers tend to be positive for higher credit scores, indicating that these outlying loans will have very high return for the lender.

![image-2.png](attachment:image-2.png)

## The spread of the loan rating and the credit scores
> I investigated the rating distribution further by splitting loans into 7 groups, one for each Prosper Rating (1.00 - 7.00). Then the standard deviation (spread) of the effective yield was calculated for each group. This was plotted against Prosper Rating and a strong **negative** correlation was observed. The uncertainty of the loan's yield therefore decreases with increasing loan rating, and so its reliability increases with increasing loan rating.


![image.png](attachment:image.png)

## And with credit score:

The same was done with credit score. The upper bound of the credit scores was chosen (as both upper and lower bounds follow the same distribution) and the loans were divided into groups based on the credit scores, which go up in 50s. The distribution of the spread of the effective yield vs. the credit score is shown below. It is also a strong negative correlation, but with an interesting peak in the middle. It is less consistently correlative than Prosper Rating, showing that credit score by itself is not such a good predictor.

![image.png](attachment:image.png)

## Effective yield does not correlate - but other variables do . . . 

> Plotting effective yield against the other variables mentioned - total investor fee, the original amount of the loans, the monthly loan repayment - also showed very little correlation. However it turns out that monthly loan repayment and the original loan amount are strongly correlated. This correlation was investigated for varying employment status. It can be seen that the relationship is strongest for part-time and retired borrowers. There is also a strong correlation for full-time borrowers, but a range of gradients are present. This shows that part-time and retired borrowers possibly only accept a very small range of loan conditions, but full-time borrowers as a group accept a much larger range of loans.

![image-2.png](attachment:image-2.png)