# Analysis of loan outcome

- [Failure rate](#Failure-rate)

    - The dependence of failure rate on loan grade and sub-grade is qualitatively
    similar to the dependence of interest rate on these features.

        - The failure rate increases steadily with poorer loan grade.

        - For high-grade loans, the failure rate increases steady with poorer loan
        sub-grade.

        - For low-grade loans, the dependence of failure rate on sub-grade is complex.

    - The variation in interest rate as a function of loan grade and sub-grade is much
    smaller that the variation in failure rate.

    - For high-grade loans, loans with a shorter term are slightly more likely to fail,
    while for low-grade loans, the opposite is true.

- [Duration](#Duration)

    - Most loans have a duration much shorter than the loan term.  Mean loan durations:

        - 20.6 months for 36-month loans.

        - 20.7 months for 60-month loans.

    - Including only loans that are fully paid, rather than loans that are in default or
    been charged off, does not significantly change the mean loan durations.

    - Among loans that are fully paid, the loan duration tends to decrease for poorer
    loan grades.  Mean loan durations:

        - 22.5 months for loans of grade 'A'.

        - 17.5 months for loans of grade 'G'.

- [Profit](#Profit)

- [Estimated rate of return](#Estimated-rate-of-return)

In [None]:
import re

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import seaborn as sns
from IPython.display import display
from matplotlib.ticker import PercentFormatter

import notebook_tools.database as db
from notebook_tools.derived_features import (
    get_annualized_return,
    get_duration,
    get_profit,
)
from notebook_tools.feature_exploration import (
    get_group_sizes,
    get_value_counts,
    style_value_counts,
)

In [None]:
loan_data = db.get_loan_data()
loan_metadata = db.get_loan_metadata()

In [None]:
loan_data["term"] = loan_data["term"].map(lambda n: str(n) + " months")

In [None]:
loan_status_counts = get_value_counts(loan_data["loan_status"])
display(style_value_counts(loan_status_counts))

## Failure rate

As explained on LendingClub's [site](https://www.lendingclub.com/help/investing-faq/what-do-the-different-note-statuses-mean),
the first two tranches of loan delinquency are:

- Late (16-30 days)
- Late (31-120 days)

The next two stages are:

- Default
- Charged off

The loan status becomes "Default" when the loan has not been current for more than 120
days. A loan status of "Charged off" indicates that LendingClub "no longer reasonable
expect\[s\] further payments." Normally this occurs within 30 days of the time the loan
enters "Default" status.

What percentage of loans go beyond delinquency?

In [None]:
loan_data["loan_failed"] = loan_data["loan_status"].map(
    lambda status: True if status in ["Default", "Charged Off"] else False
)

In [None]:
to_plot = loan_data[["grade", "loan_failed"]].groupby("grade").mean().reset_index()
fig = px.bar(
    to_plot,
    x="grade",
    y="loan_failed",
    labels={"grade": "Loan grade", "loan_failed": "Percent failed"},
    title="Percentage of failed loans (default or charged off) by loan grade",
    hover_data={"loan_failed": ":.3p"},
)
fig.update_yaxes(tickformat=".0%")
fig.show()

In [None]:
to_plot = (
    loan_data[["grade", "sub_grade", "loan_failed"]]
    .groupby(["grade", "sub_grade"])
    .mean()
    .reset_index()
)
fig = px.bar(
    to_plot,
    x="sub_grade",
    y="loan_failed",
    color="grade",
    labels={
        "grade": "Loan grade",
        "sub_grade": "Loan sub-grade",
        "loan_failed": "Percent failed",
    },
    title=(
        "Percentage of failed loans (default or charged off) "
        "by loan grade and sub-grade"
    ),
    hover_data={"loan_failed": ":.3p"},
)
fig.update_yaxes(tickformat=".0%")
fig.show()

The [notebook that analyzes interest rate](./analysis-01.html) shows the following:

- Interest rate increases systematically with loan grade
- For loans with a high grade, the interest rate varies systematically with loan
sub-grade.
- For loans with a low grade, the dependence of interest rate on sub-grade is complex.

The two charts above show that similar conclusions hold if "interest rate" is replaced
by "percentage of loan failures."

Note, however, that the variation in interest rates is distinctly smaller than the
variation in the percentage of loan failures. The next two cells compare these two
ranges.

In [None]:
min = loan_data["int_rate"].min()
max = loan_data["int_rate"].max()
print(
    'The minimum and maximum values of "int_rate" are '
    f"{min/100:.2%} and {max/100:.2%}, respectively."
)
print(f"The ratio of maximum interest rate to minimum is {max/min:.1f}.")

In [None]:
min = to_plot["loan_failed"].min()
max = to_plot["loan_failed"].max()
print(
    "The minimum and maximum percentage of failed loans are "
    f"{min:.2%} and {max:.2%}, respectively."
)
print(f"The ratio of maximum failure rate to minimum is {max/min:.1f}.")

In [None]:
to_plot = loan_data[["term", "loan_failed"]].groupby("term").mean().reset_index()
fig = px.bar(
    to_plot,
    x="term",
    y="loan_failed",
    labels={"term": "Loan term", "loan_failed": "Percent failed"},
    title="Percentage of failed loans by loan grade",
    hover_data={"loan_failed": ":.3p"},
)
fig.update_yaxes(tickformat=".0%")
fig.show()

The [notebook that explores correlations](./correlations-01.html) shows that the loan
term is strongly correlated with loan grade, tending to increase with poorer loan grade.
This correlation is the underlying cause of the dependence of failure rate on loan term
shown in the previous plot.

In [None]:
to_plot = (
    loan_data[["grade", "term", "loan_failed"]]
    .groupby(["grade", "term"])
    .mean()
    .reset_index()
)
fig = px.bar(
    to_plot,
    x="grade",
    y="loan_failed",
    color="term",
    barmode="group",
    labels={
        "grade": "Loan grade",
        "loan_failed": "Percent failed",
        "term": "Loan term",
    },
    title="Percentage of failed loans by loan grade and loan term",
    hover_data={"loan_failed": ":.3p"},
)
fig.update_yaxes(tickformat=".0%")
fig.show()

- For high-grade loans, loans with a shorter term are more likely to fail.
- The opposite is true for loans with low grades.

## Duration

How do loan durations compare to loan terms?

In exploring this question, filter out loans where payments are still in progress.

In [None]:
bool_index = loan_data["loan_status"].isin(["Fully Paid", "Charged Off", "Default"])
closed_loans = loan_data[bool_index].assign(
    loan_duration=lambda df: get_duration(df, "issue_d", "last_pymnt_d"),
    loan_failed=lambda df: df["loan_failed"].map({True: "true", False: "false"}),
)

In [None]:
to_plot = get_group_sizes(closed_loans, group_by=["term", "loan_duration"])
fig = px.bar(
    to_plot,
    x="loan_duration",
    y="count",
    color="term",
    facet_row="term",
    labels={
        "loan_duration": "Loan duration",
        "count": "Number of loans",
        "term": "Loan term",
    },
    title="Distribution of loan duration by loan term",
    hover_data={"count": ":.3s"},
    height=500,
)


def clean_up_hovertemplate(trace):
    trace.hovertemplate = re.sub(
        r"(Loan duration.*?})",
        r"\1 months",
        trace.hovertemplate,
    )


fig.for_each_trace(clean_up_hovertemplate)
fig.for_each_annotation(
    lambda ann: ann.update(text=ann.text.replace("Loan term", "Term"))
)
fig.update_yaxes(matches=None)
fig.update_layout(xaxis1_title="Loan duration (months)")
fig.show()

In [None]:
to_plot = closed_loans[["term", "loan_duration"]].groupby("term").mean().reset_index()
fig = px.bar(
    to_plot,
    x="term",
    y="loan_duration",
    color="term",
    labels={
        "loan_duration": "Mean loan duration",
        "term": "Loan term",
    },
    title="Mean loan duration by loan term",
    hover_data={"loan_duration": ":.1f"},
)


def clean_up_hovertemplate(trace):
    trace.hovertemplate = re.sub(
        r"(Mean loan duration.*?})<",
        r"\1 months<",
        trace.hovertemplate,
    )


fig.for_each_trace(clean_up_hovertemplate)
fig.update_yaxes(title="Mean loan duration (months)")
fig.show()

The duration of most loans is much shorter than the loan term.

Is there a significant difference if we only consider loans that are fully paid, rather
than loans that are in default or been charged off?

In [None]:
to_plot = get_group_sizes(
    closed_loans, group_by=["term", "loan_failed", "loan_duration"]
)
fig = px.bar(
    to_plot,
    x="loan_duration",
    y="count",
    color="term",
    facet_row="term",
    facet_col="loan_failed",
    labels={
        "loan_duration": "Loan duration",
        "count": "Number of loans",
        "term": "Loan term",
        "loan_failed": "Loan failed",
    },
    title="Distribution of loan duration by loan outcome and loan term",
    hover_data={"count": ":.3s"},
    facet_col_spacing=0.06,
    height=500,
)


def clean_up_hovertemplate(trace):
    trace.hovertemplate = re.sub(
        r"(Loan duration.*?})",
        r"\1 months",
        trace.hovertemplate,
    )


fig.for_each_trace(clean_up_hovertemplate)
fig.for_each_annotation(
    lambda ann: ann.update(text=ann.text.replace("Loan term", "Term"))
)
fig.update_yaxes(matches=None, showticklabels=True)
fig.update_layout(
    xaxis1_title="Loan duration (months)", xaxis2_title="Loan duration (months)"
)
fig.show()

In [None]:
to_plot = (
    closed_loans[["term", "loan_failed", "loan_duration"]]
    .groupby(["term", "loan_failed"])
    .mean()
    .reset_index()
)
fig = px.bar(
    to_plot,
    x="loan_failed",
    y="loan_duration",
    color="term",
    barmode="group",
    labels={
        "loan_duration": "Mean loan duration",
        "loan_failed": "Loan failed",
        "term": "Loan term",
    },
    title="Mean loan duration by loan outcome and loan term",
    hover_data={"loan_duration": ":.1f"},
)


def clean_up_hovertemplate(trace):
    trace.hovertemplate = re.sub(
        r"(Mean loan duration.*?})",
        r"\1 months",
        trace.hovertemplate,
    )


fig.for_each_trace(clean_up_hovertemplate)
fig.update_yaxes(title="Mean loan duration (months)")
fig.show()

With loans that are in default or been charged off excluded from the analysis, the
qualitative pattern is similar:  the duration of most loans is much shorter than the
loan term.

Continuing to consider only loans that are fully paid, do we find a correlation between
loan grade and loan duration?

In [None]:
paid_loans = closed_loans[closed_loans["loan_failed"] == "false"]

In [None]:
to_plot = paid_loans[["grade", "loan_duration"]].groupby("grade").mean().reset_index()
fig = px.bar(
    to_plot,
    x="grade",
    y="loan_duration",
    labels={
        "grade": "Grade",
        "loan_duration": "Mean loan duration",
    },
    title="Mean loan duration for fully paid loans by loan grade",
    hover_data={"loan_duration": ":.1f"},
)


def clean_up_hovertemplate(trace):
    trace.hovertemplate = re.sub(
        r"(Mean loan duration.*?})",
        r"\1 months",
        trace.hovertemplate,
    )


fig.for_each_trace(clean_up_hovertemplate)
fig.update_yaxes(title="Mean loan duration (months)")
fig.show()

The loan duration tends to decrease for poorer loan grades.

The next two plots illustrate the correlation in greater detail.

In [None]:
to_plot = (
    paid_loans[["term", "grade", "loan_duration"]]
    .groupby(["term", "grade"])
    .mean()
    .reset_index()
)
fig = px.bar(
    to_plot,
    x="grade",
    y="loan_duration",
    color="term",
    barmode="group",
    labels={
        "grade": "Grade",
        "loan_duration": "Mean loan duration",
        "term": "Loan term",
    },
    title="Mean loan duration for fully paid loans by loan grade and term",
    hover_data={"loan_duration": ":.1f"},
)


def clean_up_hovertemplate(trace):
    trace.hovertemplate = re.sub(
        r"(Mean loan duration.*?})",
        r"\1 months",
        trace.hovertemplate,
    )


fig.for_each_trace(clean_up_hovertemplate)
fig.update_yaxes(title="Mean loan duration (months)")
fig.show()

In [None]:
to_plot = get_group_sizes(paid_loans, group_by=["term", "grade", "loan_duration"])
fig = px.bar(
    to_plot,
    x="loan_duration",
    y="count",
    color="term",
    facet_row="grade",
    facet_col="term",
    labels={
        "loan_duration": "Loan duration",
        "count": "Number of loans",
        "term": "Loan term",
        "grade": "Grade",
    },
    title=(
        "Distribution of loan duration for fully paid loans " "by loan grade and term"
    ),
    hover_data={"count": ":.3s"},
    facet_col_spacing=0.06,
    height=800,
)


def clean_up_hovertemplate(trace):
    trace.hovertemplate = re.sub(
        r"(Loan duration.*?})",
        r"\1 months",
        trace.hovertemplate,
    )


fig.for_each_trace(clean_up_hovertemplate)
fig.for_each_annotation(
    lambda ann: ann.update(text=ann.text.replace("Loan term", "Term"))
)
fig.update_yaxes(matches=None, showticklabels=True, title="")
fig.update_layout(
    xaxis1_title="Loan duration (months)",
    xaxis2_title="Loan duration (months)",
    yaxis7_title="Number of loans",
)
fig.show()

## Profit

### Calculation

The calculation of profit as a percentage of principal is simple:

$$
  \text{percent profit} = \frac{\text{total payments} - \text{principal}}{\text{principal}}
$$

This calculation is done by the `get_profit` function from `notebook_tools.derived_features`.

As discussed in the [notebook that filters data](./data-cleaning-02.html), the dataset includes a few loans for which the feature `funded_amnt` is
different than `loan_amnt`.  Consistent with the fact that `loan_amnt` represents the amount requested by the borrower, `funded_amnt` is always less than or equal to `funded_amnt`.

Conclusion: `funded_amnt` should be used as the principal in calculating profit.

In [None]:
closed_loans["profit"] = get_profit(closed_loans, "total_pymnt", "funded_amnt")

### Analysis

What is the distribution of profit / loss from the loans?

In [None]:
closed_loans[["profit"]].describe()

In [None]:
mean_profit = closed_loans["profit"].mean()
max_loss = closed_loans["profit"].min()
max_profit = closed_loans["profit"].max()
print(
    f"The mean profit was {mean_profit:.2%}.\n"
    f"The maximum loss was {max_loss:.2%} and the "
    f"maximum profit was {max_profit:.2%}."
)

In [None]:
profit_bins = np.linspace(-1.0, 1.18, num=110)
profit_bin_labels = [f"{left:d}% to {left+1.99:.2f}%" for left in range(-100, 118, 2)]
profit_tick_vals = profit_bin_labels[0::25]
profit_tick_text = [f"{left:d}%" for left in range(-100, 118, 50)]

In [None]:
closed_loans["profit_bin"] = pd.cut(
    closed_loans["profit"], bins=profit_bins, labels=profit_bin_labels, right=False
)

In [None]:
to_plot = get_group_sizes(closed_loans, group_by="profit_bin")
fig = px.bar(
    to_plot,
    x="profit_bin",
    y="count",
    labels={"profit_bin": "Percent profit", "count": "Number of loans"},
    title="Distribution of loan profit",
    hover_data={"count": ":.3s"},
)


def clean_up_hovertemplate(trace):
    trace.customdata = profit_bin_labels
    trace.hovertemplate = trace.hovertemplate.replace("%{x}", "%{customdata}")


fig.for_each_trace(clean_up_hovertemplate)
fig.update_layout(bargap=0)
fig.update_xaxes(tickmode="array", tickvals=profit_tick_vals, ticktext=profit_tick_text)
fig.show()

In [None]:
to_plot = closed_loans[["term", "profit"]].groupby("term").mean().reset_index()
fig = px.bar(
    to_plot,
    x="term",
    y="profit",
    color="term",
    labels={
        "term": "Loan term",
        "profit": "Mean profit",
    },
    title="Mean profit by loan term",
    hover_data={"profit": ":.3p"},
)
fig.update_yaxes(tickformat=".0%")
fig.show()

In [None]:
to_plot = closed_loans[["grade", "profit"]].groupby("grade").mean().reset_index()
fig = px.bar(
    to_plot,
    x="grade",
    y="profit",
    labels={
        "grade": "Loan grade",
        "profit": "Mean profit",
    },
    title="Mean profit by loan grade",
    hover_data={"profit": ":.3p"},
)
fig.update_yaxes(tickformat=".0%")
fig.show()

In [None]:
to_plot = (
    closed_loans[["grade", "sub_grade", "profit"]]
    .groupby(["grade", "sub_grade"])
    .mean()
    .reset_index()
)
fig = px.bar(
    to_plot,
    x="sub_grade",
    y="profit",
    color="grade",
    labels={
        "grade": "Loan grade",
        "sub_grade": "Loan sub-grade",
        "profit": "Mean profit",
    },
    title="Mean profit by loan grade and sub-grade",
    hover_data={"profit": ":.3p"},
)
fig.update_yaxes(tickformat=".0%")
fig.show()

In [None]:
to_plot = (
    closed_loans[["grade", "term", "profit"]]
    .groupby(["grade", "term"])
    .mean()
    .reset_index()
)
fig = px.bar(
    to_plot,
    x="grade",
    y="profit",
    color="term",
    barmode="group",
    labels={
        "grade": "Loan grade",
        "profit": "Mean profit",
        "term": "Loan term",
    },
    title="Mean profit by loan grade and loan term",
    hover_data={"profit": ":.3p"},
)
fig.update_yaxes(tickformat=".0%")
fig.show()

## Estimate of annualized return

### Calculation

#### Method

How can the profit analyzed in the previous section be converted to an annualized rate of return?

The [internal rate of return](https://en.wikipedia.org/wiki/Internal_rate_of_return) or IRR can be used to characterize 
the rate of return on a stream of payments from an amortized loan.
However, calculation of the IRR requires knowing the value of each payment in the stream, and this information
is not included in our dataset.

In estimating an annualized return for each loan, it is natural to consider a simplified model that assumes that all
payments made during the loan duration were equal, even in cases where the loan duration was less than the loan term.
However, calculation of IRR in general requires a numerical solver, and running a numerical solver for each of the 2.2 million
loans in our dataset is likely to be time-consuming.

Our analytic goal in doing such a calculation would be to place each loan in a bin representing a range of return rates.
This binning can be done without using a numerical solver:

- Generate a dataframe $G$ of values for percent profit for amortized loans based on two discretized inputs:

    - The loan duration in months, with each value corresponding to a column of the dataframe.
    
    - The internal rate of return (IRR) for the loan, with each value corresponding to a row of the dataframe.

- In the values of profit in this dataframe, assume that all payments are equal.
    
- For each loan in the dataset, the duration of the loan specifies a column of the dataframe $G$.  Find the two values of profit
in this column that bracket the loan's percent profit.  The corresponding two values of IRR define a bin for the loan's estimate IRR.

#### Simpifying assumptions

- The major simplifying assumption described in the previous subsection is that all payments are equal, which may be quite inaccurate.
However, a calculation performed using this assumption still gives insight not available directly from the calculated profit on the loan.

- A second simplifying assumption is to treat loans in the dataset of zero duration as loans of duration 1 month.

To understand the rational for this second assumption, explore the properties of the loans that are fully paid, in default, or charged off but have a loan duration that is undefined or equal to 0.

In [None]:
loans_missing_duration = closed_loans[closed_loans["loan_duration"].isna()]

In [None]:
missing_duration_status_counts = get_value_counts(loans_missing_duration["loan_status"])
display(style_value_counts(missing_duration_status_counts))

All loans with undefined duration have status "Charged off".  Since only loans with status "Fully Paid" will be included in the calculation
of annualized return, the calculation is not affected by loans with undefined duration.

In [None]:
zero_duration_loans = closed_loans[closed_loans["loan_duration"] == 0]

In [None]:
zero_duration_status_counts = get_value_counts(zero_duration_loans["loan_status"])
display(style_value_counts(zero_duration_status_counts))

Exclude the loans of zero duration that have status "Charged Off".

In [None]:
bool_index = (closed_loans["loan_duration"] == 0) & (
    closed_loans["loan_status"] == "Fully Paid"
)
zero_duration_loans = closed_loans[bool_index]

In [None]:
zero_duration_loans[["profit"]].describe()

In [None]:
sns.set_theme()

In [None]:
plot = sns.displot(zero_duration_loans, x="profit", aspect=2.5, bins=100).set(
    title="Distribution of profit for loans of zero duration"
)
ax = plot.facet_axis(0, 0)
ax.xaxis.set_major_formatter(PercentFormatter(xmax=1.0, decimals=0))
ax.set_xticks([0.0, 0.01, 0.02, 0.03, 0.04])
ax.set_xlabel("Profit")
plt.show()

From this chart, it's clear that interest was charged on some loans that had a duration of less than a month.  But in the absence of detailed information about how interest is charged
for these loans with a duration of less than a month, annualized returns are estimated as if these
loans had a duration of 1 month.

#### Derivation of formula

The function `get_annualized_return` in `notebook_tools.derived_features` calculates the estimated rate of return

for loans that are fully paid.

The current section derives and checks the formulas used in that function.

The [internal rate of return (IRR)](https://en.wikipedia.org/wiki/Internal_rate_of_return) is calculated
by finding the [discount rate](https://en.wikipedia.org/wiki/Annual_effective_discount_rate) that causes the [net present value](https://en.wikipedia.org/wiki/Net_present_value) of a stream of payments to be zero.

Define the following notation:
- $r$ the internal rate of return
- $P$ the loan principal
- $n$ the number of monthly payments
- $M$ the monthly payment
- $T$ the total payment, or sum of payments made by the borrower
- $G$ the percent profit made by the lender

We assume that $r$, $P$, and $n$ are known, while $G$ is to be calculated.  We also assume that $T=nM$, i.e., equal monthly payments are made.

Given a net present value of 0, we have

$$
    0 = -P + \sum_{i=1}^{n}\frac{M}{(1+r/12)^n}
$$

The percent profit is

$$
    G = \frac{T - P}{P}
$$

### Analysis

How does the distribution of annualized return for fully-paid loans compare to the distribution of profit?

In [None]:
paid_loans = paid_loans.assign(
    profit=lambda df: get_profit(df, "total_pymnt", "funded_amnt"),
    annualized_return=lambda df: get_annualized_return(df, "profit", "loan_duration"),
)

In [None]:
paid_loans[["profit"]].describe()

In [None]:
paid_loans[["annualized_return"]].describe()

In [None]:
mean_return = paid_loans["annualized_return"].mean()
min_return = paid_loans["annualized_return"].min()
max_return = paid_loans["annualized_return"].max()
print(
    f"The mean annualized return was {mean_return:.2%}.\n"
    f"The minimum annualized loss was {min_return:.2%} and the "
    f"maximum was {max_return:.2%}."
)

In [None]:
return_bins = np.linspace(0, 1.0, num=101)
return_bin_labels = [f"{left:d}% to {left+.99:.2f}%" for left in range(0, 100)]
return_tick_vals = return_bin_labels[0::25]
return_tick_text = [f"{left:d}%" for left in range(0, 100, 25)]
display(return_tick_text)

In [None]:
paid_loans["annualized_return_bin"] = pd.cut(
    paid_loans["annualized_return"],
    bins=return_bins,
    labels=return_bin_labels,
    right=False,
)

In [None]:
to_plot = get_group_sizes(paid_loans, group_by="annualized_return_bin")
fig = px.bar(
    to_plot,
    x="annualized_return_bin",
    y="count",
    labels={"annualized_return_bin": "Annualized return", "count": "Number of loans"},
    title="Distribution of annualized return for fully-paid loans",
    hover_data={"count": ":.3s"},
)


def clean_up_hovertemplate(trace):
    trace.customdata = return_bin_labels
    trace.hovertemplate = trace.hovertemplate.replace("%{x}", "%{customdata}")


fig.for_each_trace(clean_up_hovertemplate)
fig.update_layout(bargap=0)
fig.update_xaxes(tickmode="array", tickvals=return_tick_vals, ticktext=return_tick_text)
fig.show()