# Filter columns and rows

- [Display feature summaries](#Display-feature-summaries)
- [Explore features](#Explore-features)
- [Filter columns and rows](#Filter-columns-and-rows)
- [Update the database](#Update-the-database)

The most recent data in the LendingClub
[dataset](https://www.kaggle.com/datasets/wordsforthewise/lending-club) is from 2018,
and since then, LendingClub has [stopped operating as a peer-to-peer
lender](https://en.wikipedia.org/wiki/LendingClub#End_of_P2P_platform,_2019-2020).
Unsurprisingly, it's difficult to find explanations on the LendingClub website about the
features in this dataset.

Sites not officially associated with LendingClub still contain information about the
peer-to-peer service previously offered by LendingClub.  As a result, the feature
exploration for this project includes links to miscellaneous pages such as blogs.

Beginning with the current notebook, however, lack of detailed information about
features does impose some limits.  For instance, rows containing certain values of
`loan_status` are filtered out simply because it is difficult to understand what those
values mean.

This notebook does the following:

- Explore features to determine what filtering should be done.
- Filter out certain columns and rows from the data on accepted loans.
- Update the database, including only the filtered data on accepted loans.

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px
from IPython.display import display

import notebook_tools.data_cleaning as clean
import notebook_tools.database as db
from notebook_tools.feature_exploration import (
    get_group_sizes,
    get_value_counts,
    style_loan_summary,
    style_value_counts,
    summarize_acc_loans,
    summarize_loan_data,
)

## Display feature summaries

Use functions in the package `notebook_tools` to load data and generate feature
summaries.

In [None]:
acc_loan_data = clean.load_acc_loan_data(excluded_cols=["member_id"])
acc_loan_data = clean.convert_acc_loan_data(
    acc_loan_data, conversions=("time_intervals", "dates", "booleans")
)

In [None]:
acc_loan_feat_desc = clean.load_acc_loan_feat_desc()

In [None]:
rej_loan_data = clean.load_rej_loan_data()
rej_loan_data = clean.convert_rej_loan_data(rej_loan_data, conversions=("percentages",))

### Feature summaries for accepted loans

#### Total number of records:  2,260,701

In [None]:
print(f"The number of records for accepted loans is {len(acc_loan_data.index):,d}.")

In [None]:
for dtype in [np.number, "string", "boolean"]:
    summary = summarize_acc_loans(acc_loan_data, dtype, acc_loan_feat_desc)
    print(f"\n\nThe number of features of type {dtype} is {len(summary.index)}.\n\n")
    display(style_loan_summary(summary))

### Feature summaries for rejected loans

#### Total number of records:  27,648,741

In [None]:
print(f"The number of records for accepted loans is {len(rej_loan_data.index):,d}.")

In [None]:
for dtype in [np.number, "string"]:
    summary = summarize_loan_data(rej_loan_data, dtype)
    display(style_loan_summary(summary))

## Explore features

### `policy_code` / `Policy Code`

What do the columns `policy_code` (for accepted loans) and `Policy Code` (for rejected
loans) refer to?

From ["What are these Policy Code 2 Loans at Lending
Club?"](https://www.fintechnexus.com/policy-code-2-loans-lending-club/):

> - These [Policy Code 2 loans] are loans made to borrowers that do not meet Lending
Club’s current credit policy standards.
> - The FICO scores on these borrowers are typically 640-659, below the 660 threshold on
Policy Code 1 loans.
> - These loans are made available to select institutional investors who have a great
deal of experience with consumer loans in this credit spectrum and with Lending Club.

In [None]:
policy_code_counts = get_value_counts(acc_loan_data["policy_code"])
display(style_value_counts(policy_code_counts))

In [None]:
policy_code_counts_rej = get_value_counts(rej_loan_data["Policy Code"])
display(style_value_counts(policy_code_counts_rej))

### `loan_status`

What are the distinct values for the column `loan_status`?

In [None]:
loan_status_counts = get_value_counts(acc_loan_data["loan_status"])
display(style_value_counts(loan_status_counts))

The 33 rows that have `NA` for `loan_status` also have `NA` for all other features other
than `id`, so these rows can be filtered from the data.

From the values of `id` displayed in the output of next cell, these null rows appear to
be associated with the policy code.

In [None]:
missing_status = acc_loan_data[acc_loan_data["loan_status"].isna()]

In [None]:
display(missing_status.head(4).transpose())

As a check, verify that if the `id` column is dropped, then all values are `NA` in rows
that are missing `loan_status`.

In [None]:
display(missing_status.drop("id", axis="columns").count().sum())

Create a dataframe that has these empty rows filtered out.  After additional filtering,
this dataframe will be used to recreate the SQLite database.

In [None]:
filtered_loan_data = acc_loan_data[acc_loan_data["loan_status"].notna()]

In [None]:
loan_status_counts = get_value_counts(filtered_loan_data["loan_status"])
display(style_value_counts(loan_status_counts))

Note that after the rows with missing `loan_status` have been filtered out, there are no
missing values for `policy_code`.  Since all rows have the same value for `policy_code`,
this column can be dropped.

In [None]:
policy_code_counts = get_value_counts(filtered_loan_data["policy_code"])
display(style_value_counts(policy_code_counts))

Most of the values for `loan_status` are explained at ["What Do the Different Note Statuses
Mean?"](https://www.lendingclub.com/help/investing-faq/what-do-the-different-note-statuses-mean).

However, the values `Does not meet the credit policy. Status:Fully Paid` and `Does not
meet the credit policy. Status:Charged Off` are unclear. Let's take look at a random
sample of the rows that have these value of loan status.

In [None]:
bool_index = filtered_loan_data["loan_status"].str.endswith("Status:Fully Paid")
sampled_data = filtered_loan_data[bool_index].sample(
    n=5, random_state=59147, axis="index"
)
with pd.option_context("display.max_columns", None):
    display(sampled_data)

In [None]:
bool_index = filtered_loan_data["loan_status"].str.endswith("Status:Charged Off")
sampled_data = filtered_loan_data[bool_index].sample(
    n=5, random_state=59147, axis="index"
)
with pd.option_context("display.max_columns", None):
    display(sampled_data)

Nothing jumps out from this small random sample.  Rather than trying to guess why
certain rows do not meet the credit policy, I'll exclude these rows.

In [None]:
bool_index = filtered_loan_data["loan_status"].str.startswith("Does not meet")
filtered_loan_data = filtered_loan_data[~bool_index]

In [None]:
loan_status_counts = get_value_counts(filtered_loan_data["loan_status"])
display(style_value_counts(loan_status_counts))

### `issue_d`

The description of this feature is "The month which the loan was funded".

After rows with problematic values of `loan_status` have been filtered out, there are no
missing values for `issue_d`.

In [None]:
filtered_loan_data["issue_d"].isna().sum()

In [None]:
to_plot = get_group_sizes(filtered_loan_data, by="issue_d")
fig = px.line(
    to_plot,
    x="issue_d",
    y="count",
    markers=True,
    labels={"issue_d": "Loan date", "count": "Number of loans"},
    hover_data={"count": ":.3s"},
    title="Number of accepted loans by date",
)
fig.show()

I will exclude pre-2012 dates from the analysis.

Analysis and prediction based on this data will need to take account of changes in
behavior over time, and given the relatively small number of loans issued before 2012,
it is not worthwhile to include the pre-2012 data.

In [None]:
bool_index = filtered_loan_data["issue_d"] >= "2012-01"
filtered_loan_data = filtered_loan_data[bool_index]

In [None]:
to_plot = get_group_sizes(filtered_loan_data, by="issue_d")
fig = px.line(
    to_plot,
    x="issue_d",
    y="count",
    markers=True,
    labels={"issue_d": "Loan date", "count": "Number of loans"},
    hover_data={"count": ":.3s"},
    title="Number of accepted loans by date",
)
fig.show()

### `loan_amnt` / `funded_amnt` / `funded_amnt_inv`

What is the distinction between `loan_amnt`, `funded_amnt`, `funded_amnt_inv`?

Start by examining the feature descriptions.

In [None]:
amount_features = acc_loan_feat_desc.loc[
    ["loan_amnt", "funded_amnt", "funded_amnt_inv"], ["description"]
]
display(style_loan_summary(amount_features))

What should we infer in cases where `loan_amnt` is different than `funded_amnt`, or in
cases where `funded_amnt` is different than `funded_amnt_inv`?  It's not completely
clear from these descriptions.

Investigate the frequency of these cases.

In [None]:
# First check for missing values.
for column_name in ["loan_amnt", "funded_amnt", "funded_amnt_inv"]:
    na_count = filtered_loan_data[column_name].isna().sum()
    print(f'\nThe number of missing values for feature "{column_name}" is {na_count}.')

In [None]:
bool_index = (filtered_loan_data["loan_amnt"] - filtered_loan_data["funded_amnt"]) != 0
print(
    '\nThe number of loans with "loan_amnt" different than "funded_amnt" is '
    f"{sum(bool_index)}.\n"
)

to_plot = get_group_sizes(filtered_loan_data[bool_index], by="issue_d")
fig = px.scatter(
    to_plot,
    x="issue_d",
    y="count",
    labels={"issue_d": "Loan date", "count": "Number of loans"},
    hover_data={"count": ":,d"},
    title='Number of loans with "loan_amnt" different than "funded_amnt"',
)
fig.show()

In [None]:
bool_index = (
    filtered_loan_data["funded_amnt"] - filtered_loan_data["funded_amnt_inv"]
) != 0
print(
    '\nThe number of loans with "funded_amnt" different than "funded_amnt_inv" is '
    f"{sum(bool_index)}.\n"
)

to_plot = get_group_sizes(filtered_loan_data[bool_index], by="issue_d")
fig = px.scatter(
    to_plot,
    x="issue_d",
    y="count",
    labels={"issue_d": "Loan date", "count": "Number of loans"},
    hover_data={"count": ":,d"},
    title='Number of loans with "funded_amnt" different than "funded_amnt_inv"',
)
fig.show()

Discussion:

- Only 68 of the 2.2 million loans have `loan_amnt` different than `funded_amnt`.
Essentially all the loans are fully funded.
- About 130k of the loans have different values for `funded_amnt` and `funded_amnt_inv`.
Is LendingClub itself providing funding in these case?

While I don't understand the cause of the differences between `loan_amnt`,
`funded_amnt`, and `funded_amnt_inv`, I won't filter out the rows with different values
for these features.  Unlike the rows where `loan_status` includes the string `"Does not
meet the credit policy"`, there isn't a strong indication that rows with different
values for `loan_amnt`, `funded_amnt`, and `funded_amnt_inv` are fundamentally
problematic.

### `initial_list_status`

The feature `initial_list_status` is explained in [this blog
post](https://sirallen.name/blog/note-on-lending-club/):

> The variable initial_list_status is available in the public data and identifies
whether a loan was initially listed in the whole (W) or fractional (F) market. Loans
listed “whole” become available for fractional funding (and vice versa) if there are no
buyers within a certain time frame.

In [None]:
list_status_counts = get_value_counts(filtered_loan_data["initial_list_status"])
display(style_value_counts(list_status_counts))

Given this explanation of the feature `initial_list_status`, there's no need to drop the
feature or filter out rows based on the value of the feature

## Filter columns and rows

Taking account of the feature summaries and the feature exploration above, certain
columns will be excluded from the analysis of accepted loans.

- url:  URL for the LC page with listing data
- title:  The loan title provided by the borrower
- desc:  Loan description provided by the borrower
- policy_code:  publicly available policy_code=1, new products not publicly available policy_code=2

Also, rows will be featured out based on the following criteria:

- Problematic values for `loan_status`
    1. `<NA>`
    2. `Does not meet the credit policy. Status:Fully Paid`
    3. `Does not meet the credit policy. Status:Charged Off`
- Values of `issue_d` before 2012

In [None]:
filtered_loan_data = filtered_loan_data.drop(
    ["url", "title", "desc", "policy_code"], axis="columns"
)

## Update the database

The information available on rejected loans is fairly limited, so for now we'll limit
our attention to the data on accepted loans.

Columns and rows that have been filtered out in this notebook will not be included in
the database.

In [None]:
loan_metadata = clean.load_acc_loan_metadata()
# The metadata on accepted loans has been manually updated with columns to support
# feature classification.  These columns will be used in a later notebook, but for now
# they are excluded from the database.
loan_metadata = loan_metadata[["data type", "description"]]
# Columns that have been excluded from the loan data are filtered from the index of
# metadata.
bool_index = loan_metadata.index.isin(filtered_loan_data.columns)
loan_metadata = loan_metadata[bool_index]

In [None]:
tables = {
    "loan_data": filtered_loan_data,
    "loan_metadata": loan_metadata.reset_index(),
}

In [None]:
db.create_database(tables)