# Proper Loans Exploratory Analysis

## Introduction

The company Prosper plays the middleman between investors and people who need money. They allows non expert investors to fund various sized loans at various rates. Prosper allows API access to their data so users can make investments. Udacity.com, a data analytics education platform, supplied the Prosper loan data via API. The dataset contain 113937 rows and 81 columns. 

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

import random
random.seed(42)

In [None]:
prosper_loans_df = pd.read_csv("prosper_loan_data.csv")
pd.set_option('max_columns', None)
colors = sb.color_palette("Greens_r")
prosper_loans_df.sample(4)

## Exploratory Analysis

### What does a Prosper loan look?

We will begin by examining the features of a prosper loan. A prosper loan has an amount, a request type, a monthly term amount, an interest rate, a current status, and a risk score. A loan also comes from a location and has a particular amount of investors financing the loan.

### Number of Investors

In [None]:
prosper_loans_df.Investors.describe()

In [None]:
plt.figure(figsize=(14,12)).subplots_adjust( hspace=0.5)

plt.subplot(3,1,1)

bins = np.arange(0, 1200+1, 50)
xticks = bins
xlabels = [str(b) for b in bins]

plt.hist(data=prosper_loans_df, x="Investors", color=colors[2], bins=bins);
plt.xticks(ticks=xticks, labels=xlabels, rotation=90)

plt.title("Frequency of Investors Per Loan")
plt.xlabel("Number Of Investors")
plt.ylabel("Frequency");

plt.subplot(3,1,2)

bins = np.arange(0, 50, 1)
xticks = bins
xlabels = [str(b) for b in bins]

plt.hist(data=prosper_loans_df, x="Investors", color=colors[2], bins=bins);
plt.xticks(ticks=xticks, labels=xlabels, rotation=90)

plt.title("Frequency of Investors Per Loan")
plt.xlabel("Number Of Investors")
plt.ylabel("Frequency");
plt.xlim(0, 50)

plt.subplot(3,1,3)
bins = np.arange(2, 50, 1)
xticks = bins
xlabels = [str(b) for b in bins]

plt.hist(data=prosper_loans_df, x="Investors", color=colors[2], bins=bins);
plt.xticks(ticks=xticks, labels=xlabels, rotation=90)

plt.title("Frequency of Investors Per Loan")
plt.xlabel("Number Of Investors")
plt.ylabel("Frequency");
plt.xlim(0, 50)

The histograms have a difficult time telling the distribution. But the majority by far falls into the bucket of 1 investor.

### Amount

A loan has an amount requested. The box plot below and the descriptive statistics show the amount for a typical loan.

In [None]:
prosper_loans_df.LoanOriginalAmount.describe()

To view outliers we can use a boxplot.

In [None]:
plt.figure(figsize=(8,8))
boxplot = sb.boxplot(data=prosper_loans_df, y="LoanOriginalAmount", color=colors[2])
plt.xlabel("Requested Loan Amount")
plt.ylabel("Amount ($)");
plt.title("Requested Loan Amount");

We can see from the descriptive statistics and the boxplot that a lot of outliers exist creating a right skew of the data. 

A histogram below shows the frequencies of each loan amount.

In [None]:


# plt.subplot(2,2,1)
plt.figure(figsize=(10,8))
bins = np.arange(0, 35000 + 1, 2500)
xlabels = ["{:.2f}".format(b) for b in bins]

plt.hist(data=prosper_loans_df, x="LoanOriginalAmount", color=colors[2], bins=bins);
plt.xticks(ticks=bins, labels=xlabels, rotation=90)

# yticks = np.arange(0, 25000 + 1, 5000)
# ylabels = ["{:.2f}".format(tick/n_rows) for tick in yticks]
# plt.yticks(ticks=yticks, labels=ylabels)
plt.title("Frequency of Loan Amounts")
plt.xlabel("Loan Amount Requested ($)")
plt.ylabel("Frequency");

This chart shows that roughly 30,000 loans were given out between 2500 and 5000 dollars. That by far is the most frequently occurring amount for loans given out.

The typical loan is between 1000 and 7500. There are also plenty of higher amount loans from 7500 to 17500. The right tail above that is quite rare.

The descriptive statistics don't prove to be as useful because of the heavy right skew in the data. 

### Request Type

Loans can be classified in a few different ways:

In [None]:
prosper_loans_df["ListingCategory (numeric)"]
listing_categories = ["Not Available", "Debt Consolidation", "Home Improvement", "Business",
  "Personal Loan", "Student Use", "Auto", "Other", "Baby&Adoption", "Boat",
  "Cosmetic Procedure", "Engagement Ring", "Green Loans", "Household Expenses",
  "Large Purchases", "Medical/Dental", "Motorcycle",
  "RV", "Taxes", "Vacation", "Wedding Loans"]
categories_df = prosper_loans_df["ListingCategory (numeric)"].apply(lambda i: listing_categories[i]).copy()


In [None]:
plt.figure(figsize=(16,10))
sb.countplot(categories_df, color = colors[2], orient='h', order=categories_df.value_counts().index)
plt.xticks(rotation=90, fontsize=16)
plt.yticks(fontsize=16)
plt.xlabel("Reasons", fontsize=18)
plt.ylabel("Frequency", fontsize=18)
plt.title("Reasons for Requested Loans", fontsize=20);

In [None]:
categories_df = pd.DataFrame(categories_df).rename(columns={"ListingCategory (numeric)": "Reason"})
reason_df = categories_df.query("Reason != 'Not Available' and Reason != 'Other'")
reason_df = reason_df.Reason.value_counts() / reason_df.Reason.shape[0]
reason_df = reason_df.iloc[0:3]
reason_df["Other"] = 1 - reason_df.sum()
plt.figure(figsize=(8, 8), facecolor="#EEEEEE")
plt.pie(reason_df, labels=reason_df.index, startangle=90,
        counterclock=False, wedgeprops = {'width': 0.4},
        colors=colors,
        textprops={'fontsize': 14}
        );
plt.axis("square");
plt.title("Reasons For Loans", fontsize=16)
plt.text(x=-0.56, y = 0, s="Debt Consolidation: 68%", fontdict={"fontsize":16,
                                                               "weight": "bold"});

The majority of the types of loans requested seem to be debt consolidation, followed by home improvements and buisness decisions.

### Terms

Loan terms are the amount of months to pay back the loan. Loans are offered for 1 year, 3 year, and 5 year periods.

In [None]:
from pywaffle import Waffle
from collections import OrderedDict

data = round(100 * prosper_loans_df.Term.value_counts() /  prosper_loans_df.Term.shape[0], 0)
sorted_data = OrderedDict(sorted(data.items(), key=lambda x: x[0]))
fig = plt.figure(
    FigureClass=Waffle,
    rows=10,
    columns=20,
    values=sorted_data,
    title={'label': 'Porportion of Loans By Term Length', 'loc': 'center', 'fontdict': {"fontsize": 18}},
    colors=[colors[4], colors[2], colors[0]],
    labels=["{} Month Term: {:.2f}% ".format(k, v/100) for k, v in sorted_data.items()],
    legend={'loc': 'lower left',
            'ncol': len(data),
            'bbox_to_anchor': (0, -0.15),
            'framealpha': 0,
            'fontsize':  12},
    figsize=(10,10)
)
fig.set_facecolor('#EEEEEE')
plt.show()

The waffle chart below shows us the porportions of loan amount. It is also encoded by term length: the lighter the color the lower the term amount.

### Current Status

We can break loans down by status. Current status can be seen as the outcome of the loan. This paits a picture of the total history of loans from prosper.

In [None]:
count_loan_df =  pd.DataFrame(prosper_loans_df.LoanStatus.value_counts())
count_loan_df.index.name = "Status"
count_loan_df = count_loan_df.rename(columns={"LoanStatus": "Count"})
count_loan_df = count_loan_df.reset_index()

In [None]:
base_color = sb.color_palette()[0]
sb.countplot(data=prosper_loans_df, y="LoanStatus", color=colors[2],
            order = prosper_loans_df.LoanStatus.value_counts().index);
plt.title("Loan History")
xticks = np.arange(0, 70000 + 1, 10000)
plt.xticks(ticks=xticks)
# plt.xticks(rotation=90)
# plt.
type_counts = prosper_loans_df.LoanStatus.value_counts()
n_rows = prosper_loans_df.LoanStatus.shape[0]
for i in range(type_counts.shape[0]):
    count = type_counts[i]
    pct_str = '{:0.1f}%'.format(100*count / n_rows)
    # first is x and yposition, third is what gets printed
    # fourth centers the vertical alignment on the text on a bar
    plt.text(count+1, i, pct_str, va="center")

This shows that most loans prosper has are current, meaning they right now have about as many loans open as they have ever had. Most loans are completed without defaulting.  The labels other then the top 4 seem like mistakes.

### Interest Rates

To put it simply the interest rate ties the investor to the borrower. The borrower has to pay the interest rate on the loan and the investor recieves the interest rate as payment.

In [None]:
prosper_loans_df["BorrowerRate"].describe()

To view outliers we can use a boxplot

In [None]:
plt.figure(figsize=(8,8))
boxplot = sb.boxplot(data=prosper_loans_df, y="BorrowerRate", color=colors[2])
plt.xlabel("Interest Rate")
plt.ylabel("Rate (%)");
plt.title("Interest Rate on Prosper Loans");

We can see a few outliers above 0.4, but nothing to seriously skew our results.

In [None]:
plt.figure(figsize=(15,12))
bins = np.arange(0, 0.5, 0.04)
n_rows = prosper_loans_df.BorrowerRate.shape[0]

plt.subplot(2,2,1)

plt.hist(data=prosper_loans_df, x="BorrowerRate", bins=bins, color=colors[2]);

xlabels = ["{:.2f}".format(b) for b in bins]
plt.xticks(ticks=bins, labels=xlabels, rotation=90)

yticks = np.arange(0, 25000 + 1, 5000)
ylabels = ["{:.2f}".format(tick/n_rows) for tick in yticks]
plt.yticks(ticks=yticks, labels=ylabels)
plt.title("PMF: Interest Rates")
plt.xlabel("Interest Rates")
plt.ylabel("Probability of Occurance")

plt.subplot(2,2,2)
plt.hist(data=prosper_loans_df, x="BorrowerRate", cumulative=True,
         bins=bins, color=colors[2]);
plt.xticks(ticks=bins, labels=xlabels, rotation=90);

yticks = np.arange(0, 125000, 20000)
ylabels = ["{:.2f}".format(tick/n_rows) for tick in yticks]
plt.yticks(ticks=yticks, labels=ylabels)
plt.title("CDF: Interest Rates")
plt.xlabel("Interest Rates")
plt.ylabel("Cumulative Probability of Occurance");

We can also see that the distribution for interest rates given out is close to being normal by the roughly vertical rise in the CDF.

The most frequently occurring bucket is betweeen 0.12 and 0.16, with a 20% probability that an interest rate falls into this bucket. This is below the 0.16 to 0.2 bucket which contans the median of 0.18 and the mean of 0.19. This indicates a slight right skew given by a few outliers. These outliers are better viewed in the boxplot.

This shows the most frequently occuring cost on the loan is between 0.08 and 0.24 % which shows the typical cost to the borrower and the typical return on investment.

### Risk Scores

Risk scores are a derived measure of estimated risk. 10 being the lowest risk and 1 being the highest risk. The feature is titled ProperScore.


In [None]:
prosper_loans_df["ProsperScore"].describe()

We can view outliers with boxplots

In [None]:
plt.figure(figsize=(8,8))
boxplot = sb.boxplot(data=prosper_loans_df, y="ProsperScore", color=colors[2])
plt.xlabel("Risk")
plt.ylabel("Risk Score");
plt.title("Estimated Risk Score on Prosper Loans");

For some reason some risk scores given were above 10, which must be a mistake. Other then that the distribution looks to be relatively straigtforward.

Below we can see the percentage associated with each prosper score.

In [None]:
plt.figure(figsize=(8,8))
base_color = sb.color_palette()[0]
sb.countplot(data=prosper_loans_df, x="ProsperScore", color=colors[2]);
plt.title("Prosper Loan Estimated Risk Scores");
plt.xlabel("Risk Score")
plt.ylabel("Frequency");

n_rows =  prosper_loans_df.ProsperScore.shape[0]
type_counts = prosper_loans_df.ProsperScore.value_counts()
type_counts.index.name = "score"

type_counts = type_counts.reset_index()
type_counts.sort_values(by="score", inplace=True)
type_counts = type_counts.reset_index(drop=True)
type_counts["pct"] = round(type_counts.ProsperScore / prosper_loans_df.shape[0], 4)
for i in range(type_counts.shape[0]):
    pct_str = '{:.2f}%'.format(100 * type_counts.iloc[i].pct)
#     # first is x and yposition, third is what gets printed
#     # fourth centers the vertical alignment on the text on a bar
    plt.text(x=i-0.35, y=type_counts.iloc[i].ProsperScore + 100, s =pct_str)

The bulk of data seems to be between 4 and 8. 4 is the most frequently occurring value. But the risk scores seem to closely follow a normal distribution around the value 6, which is the median and mean. 

## Who Is Borrowing Money

What is the typical profile of person borrowing money? These people come from a particular states, have a particular income, and have a particular credit score.

### income Level

Loans are given to people who have a particular reported monthly income.

In [None]:
prosper_loans_df.StatedMonthlyIncome.describe()

In [None]:
plt.figure(figsize=(16,10))
bins = np.arange(0, 20000+1, 1000) 
xlabels = [str(b) for b in bins]
prosper_loans_df.StatedMonthlyIncome.hist(color=colors[2], bins=bins)
plt.xticks(ticks=bins, labels=xlabels, rotation=90)
plt.xlim(0, 20000)
plt.title("Monthly Incomes")
plt.xlabel("Montly Income ($)")
plt.ylabel("Frequency");

Income levels have a right skew with the majority being between 2000 and 6000. That is yearly incomes betweeen 24000 and 72000 dollars a year. That is to be expected. The right tip makes 20,000 dollars a month which comes out to 240,000 dollars a year.  The mean and median is about 4600-5600 dollars per month. The mean is greater than the median indicating a right skew.

### Credit Score

What are the credit scores of the people who are given loans

In [None]:
plt.figure(figsize=(16,10))
bins = np.arange(400, 1000, 25)
labels = [str(b) for b in bins]
avg_credit_score = round((prosper_loans_df.CreditScoreRangeLower + prosper_loans_df.CreditScoreRangeUpper)/ 2, 0)
avg_credit_score.hist(color=colors[2], bins=bins)
plt.xticks(ticks=bins, labels=labels, rotation=90)
plt.title("Credit Scores")
plt.xlabel("Credit Score")
plt.ylabel("Frequency")
plt.xlim(400, 1000)

650 t0 675 is the most common credit score with the range of 650 to 775 making up the majority of the distribution. 

In [None]:
avg_credit_score.describe()

### Loan Locations

Loans are given to people of various states. 

California borrows the highest amount of loans by almost double of the second most frequent places: New York, Texas, and Florida. California recieves14,171 while the otheer three 6700-6900. This shows where most of these loans are being borrowed from. 

The distribution is also right skewed with most being in the range of 0-500. That is why there is a lot of white on the state map

In [None]:
import plotly.express as px

states = prosper_loans_df[prosper_loans_df["BorrowerState"].notna()]["BorrowerState"].unique()

data = pd.DataFrame(prosper_loans_df.BorrowerState.value_counts())
data.index.name = "State"
data = data.rename(columns={"BorrowerState": "Amount Of Loans"}).reset_index()
data
fig = px.choropleth(data, locations='State',
                    locationmode="USA-states",
                    color="Amount Of Loans",
                    color_continuous_scale="Greens",
                    scope="usa")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

data.describe()

bins = np.arange(0, 8000, 500)
xlabels = [str(b) for b in bins]
plt.figure(figsize=(16, 8))
plt.hist(data["Amount Of Loans"], color=colors[2], bins=bins);
plt.xlim(0, 8000)
plt.xticks(ticks=bins, labels=xlabels);
plt.xlabel("Loans Given In A State")
plt.ylabel("Frequency of States")
plt.title("Prosper Loans Given Per State");

### Estimated Return Distribution

In [None]:
plt.figure(figsize=(16, 10))
bins = np.arange(-0.1, 0.3, 0.01)
labels = ["{:.2f}".format(b) for b in bins]
prosper_loans_df.EstimatedReturn.hist(color=colors[2], bins=bins)
plt.xticks(ticks=bins, labels=labels, rotation=90);

The distribution of returns is about normal with the highest occuring return of 0.07 or 0.08

## Bivariate/ Multivariate Plots

### Question 1: What is the relationship of risk score and state?
Seems like there isn't much of a relationship. They seem to follow the average-median

In [None]:
states = prosper_loans_df[prosper_loans_df["BorrowerState"].notna()]["BorrowerState"].unique()

cols_to_examine = ["ProsperScore", "BorrowerState"]
data = prosper_loans_df[cols_to_examine].groupby("BorrowerState").mean()
data.describe()
# data.index.name = "State"
# data = data.rename(columns={"BorrowerState": "Amount Of Loans"}).reset_index()
# data
# fig = px.choropleth(data, locations='State',
#                     locationmode="USA-states",
#                     color="Amount Of Loans",
#                     color_continuous_scale="Greens",
#                     scope="usa")
# fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
# fig.show()

### Question 2: What is the interest rate by state?

In [None]:
states = prosper_loans_df[prosper_loans_df["BorrowerState"].notna()]["BorrowerState"].unique()

cols_to_examine = ["BorrowerRate", "BorrowerState"]
data = prosper_loans_df[cols_to_examine].groupby("BorrowerState").mean()
data.describe()

Conclusion: Relationship by states aren't very interesting. They have a very narrow distribution which means little useful predictive power.

### Question 3: Is there a relationship with time and Interest rate, ARP, and Estimated Return?

In [None]:
month_year_df = prosper_loans_df.copy()
from datetime import datetime
format_str = '%Y-%m-%d %H:%M:%S'

month_year_df["month"] = month_year_df.ListingCreationDate.apply(lambda x: datetime.strptime(x.split('.')[0], format_str).month).astype(int)
month_year_df["year"] = month_year_df.ListingCreationDate.apply(lambda x: datetime.strptime(x.split('.')[0], format_str).year).astype(int)

In [None]:
cols_of_interest = ["BorrowerRate", "BorrowerAPR", "EstimatedReturn", "month", "year"]
costs_by_month_year = month_year_df[cols_of_interest].groupby(["year", "month"]).mean().dropna()

In [None]:
costs_by_month_year.plot(kind="line", figsize=(16,10), color=[colors[0], colors[4], colors[4]])
plt.ylabel("Rates (% of loan)")
plt.title("Returns By Rates");

This is a much better direction to go in, because if we can benchmark over time the returns of this kind of investment, that is much more useful.

### Question 3: What are the risk scores associated with the probability of default and associated estimated return.

First lets find the risk score associated with an estimated return from a correlation plot.

So we can have a correlation with the estimated return and the risk score and then color them with the ones that defaulted.

In [None]:
def build_query(arr):
    outcome_arr = []
    for elem in arr:
        outcome_arr.append("LoanStatus == '{}'".format(elem))
    return ' or '.join(outcome_arr)

assert build_query(['Completed', 'Current']) == "LoanStatus == 'Completed' or LoanStatus == 'Current'"

basic_outcomes = prosper_loans_df[["ProsperScore", "EstimatedReturn", "LoanStatus", "Term"]].copy()
 
query_str = build_query(["Completed", "Defaulted", "Chargedoff", "FinalPaymentInProgress"])
basic_outcomes = basic_outcomes.query(query_str).dropna().reset_index(drop=True)
basic_outcomes.LoanStatus = basic_outcomes.LoanStatus.replace({"Completed": 1,
                                                               "FinalPaymentInProgress": 1,
                                                               "Defaulted": 0, "Chargedoff": 0})
data_sample = basic_outcomes.sample(500)

In [None]:
data_sample.LoanStatus.mean()

In [None]:
check_term_default_rate = data_sample[["Term", "LoanStatus", "EstimatedReturn"]].groupby(["Term"]).mean()
check_term_default_rate

In [None]:
check_risk_default_rate = data_sample[["ProsperScore", "LoanStatus", "EstimatedReturn"]].groupby("ProsperScore").mean()
check_risk_default_rate

In [None]:
check_term_by_risk = data_sample[["Term", "ProsperScore", "LoanStatus"]].groupby(["Term"]).mean()
check_term_by_risk

Even though Term is not a big determinant in tthe risk score, they are a huge factor in default rate. 

In [None]:
success_color, failure_color = sb.color_palette("colorblind")[2], sb.color_palette("colorblind")[3]

plt.figure(figsize=(16, 12))
sb.regplot(data=data_sample.query("LoanStatus == 1"),
            x="ProsperScore", y="EstimatedReturn",
           x_jitter=0.4,
          scatter_kws={'alpha': 0.45},
           marker='o',
          ci=None,
          color=success_color)
sb.regplot(data=data_sample.query("LoanStatus == 0"),
            x="ProsperScore", y="EstimatedReturn",
           x_jitter=0.4,
          scatter_kws={'alpha': 0.45},
          ci=None,
           color=failure_color,
           marker='x')
plt.title("Prosper Score By Estimated Return")
corr = data_sample[["EstimatedReturn", "ProsperScore"]].corr().iloc[0].ProsperScore
corr_label = 'Correlation: {:.4f}'.format(corr)
plt.text(y=-0.12, x=7.5, s=corr_label, fontsize=14);

In [None]:
import statsmodels.api as sm

data_sample['intercept'] = 1
# First is y then a list of x variables
lm = sm.OLS(data_sample['EstimatedReturn'], data_sample[['intercept', 'ProsperScore']])
results = lm.fit()
results.summary()


In [None]:
lm = sm.OLS(data_sample['EstimatedReturn'], data_sample[['intercept', 'Term']])
results = lm.fit()
results.summary()

Results: With statistical significance, a 1 point increase in prosper score is associated with a -0.0034 Estimated Return. Limitation is that it only explains about 4% of the reason why this occurs, but it still is statistically significant and practically important. 

This shows us that even though there is a clear relationship between the two, defaults occur at every level and they are only associated with a slightly higher return. 

In [None]:
data_sample.sample(2)

In [None]:
success_color, failure_color = sb.color_palette("colorblind")[2], sb.color_palette("colorblind")[3]

plt.figure(figsize=(16, 12))
sb.regplot(data=data_sample.query("Term == 12"),
            x="ProsperScore", y="EstimatedReturn",
           x_jitter=0.4,
          scatter_kws={'alpha': 0.45},
          ci=None,
          color="green")
sb.regplot(data=data_sample.query("Term == 36"),
            x="ProsperScore", y="EstimatedReturn",
           x_jitter=0.4,
          scatter_kws={'alpha': 0.45},
          ci=None,
          color="blue")
sb.regplot(data=data_sample.query("Term == 60"),
            x="ProsperScore", y="EstimatedReturn",
           x_jitter=0.4,
          scatter_kws={'alpha': 0.45},
          ci=None,
          color="red")

plt.title("Prosper Score By Estimated Return")
corr = data_sample[["EstimatedReturn", "ProsperScore"]].corr().iloc[0].ProsperScore
corr_label = 'Correlation: {:.4f}'.format(corr)
plt.text(y=-0.12, x=7.5, s=corr_label, fontsize=14);

In [None]:

data_sample_2 = basic_outcomes[["ProsperScore", "LoanStatus", "EstimatedReturn"]].copy()
data_sample_2.LoanStatus = 1 - data_sample_2.LoanStatus
data_sample_2 = data_sample_2.groupby("ProsperScore").mean()
data_sample_2.plot(kind="line", figsize=(16,10))
plt.title("Probability of Default V. Estimated Return");
plt.xlabel("Prosper Score")
plt.ylabel("Percentage (%)")

### Default Risk By Terms

In [None]:
data_sample_2 = data_sample[["Term", "LoanStatus", "EstimatedReturn"]].copy()
data_sample_2.LoanStatus = 1 - data_sample_2.LoanStatus
data_sample_2 = data_sample_2.groupby("Term").mean()
data_sample_2.plot(kind="line", figsize=(16,10))
plt.title("Probability of Default V. Estimated Return");
plt.xlabel("Term")
plt.ylabel("Percentage (%)")

This above chart shows it is riskier to take out longer debt but it leads to a higher return

### Across different risk scores, what percentage of the loans have defaulted?

In [None]:
data_sample_2

In [None]:
data_sample_2 = data_sample_2[["ProsperScore", "LoanStatus", "EstimatedReturn"]]
data_sample_2 = data_sample_2.groupby("ProsperScore").mean()
data_sample_2.plot(kind="line", figsize=(16,10))
plt.title("Probability of Not Default V. Estimated Return");
plt.xlabel("Risk Score")
plt.ylabel("Percentage (%)")

This above chart shows that the estimted returns on the loan and the probability of the default don't have the clearest relationship. Therefore, by investing in high prosper risk numbers you can get a significantly higher probability of payment back with not much of a difference of estimated return.

## Data Story: Direction


Below is roughly the data story that would be told given the above exploratory analysis

- Key takeaways
    - Prosper loans are an investment option not that many people are aware of
    - I can describe the investment option
        - Key factors are:
            - Describe what the loans are:
                - Show the states they are coming from, the percentage of what they are funding, the typical amounts of the loans, the number of investors for each loan
                    - Where:
                        - States: Show states map
                    - What:
                        - What they are funding: Donut graph by the top 3 types and then all others
                    - Who:
                        - Income distribution: histogram (PMF by CDF)
                        - investors: bucketed in a few categories, waffle chart
                    - How long:
                        - terms of loan in waffle chart
                    - Risk Level:
                        - Show distribution of risk
                        - Show a probability of default 
            - APR, Interest Rate, and Estimated Return
                - Show distributions of estimated return, APR and interest rates (all have a similar distribution)
                    - Show a faceted histogram to show all three 

                - Investment strategy:
                    - Over time:
                        - Show that APR and estimated return fluctuate with the interest rate, which is predictable (determined at time of investment): show time series graph of APR, interest, and return
                    - Focus on Term and Risk Level as key determinants in what to expect as an investor