# Classify features

- [Prioritize prevention of data leakage](#Prioritize-prevention-of-data-leakage)
- [Explore features](#Explore-features)
- [Display features by group](#Display-features-by-group)
- [Create a SQLite database](#Create-a-SQLite-database)

I manually added two columns to the table of metadata for accepted loans.

- `category` labels the features based on the type of information they contain, e.g.,
information about the borrower or details about the loan.
- `known at loan origination` indicates whether the feature can be used in predictive
models that require information available at the time of loan origination.

This notebook does the following:

- Explore features for which the classification is initially unclear.
- Group all features by the added columns `category` and `known at loan origination`,
displaying the features in each group.
- Create a SQLite database

In [None]:
from IPython.display import display

from notebook_tools.data_cleaning import (
    convert_acc_loan_data,
    filter_acc_loan_data,
    load_acc_loan_data,
    load_acc_loan_metadata,
)
from notebook_tools.database import create_database
from notebook_tools.feature_exploration import (
    get_value_counts,
    style_loan_summary,
    style_value_counts,
)

In [None]:
loan_data = load_acc_loan_data().pipe(convert_acc_loan_data).pipe(filter_acc_loan_data)

In [None]:
loan_metadata = load_acc_loan_metadata(cleaned_data=loan_data)

In [None]:
feature_categories = loan_metadata["category"].unique()
list(feature_categories)

## Prioritize prevention of data leakage

For many of the features that characterize the borrower's credit history, it is
difficult to know whether the information would have been available at loan origination.

The cause of the ambiguity is that LendingClub continued to pull credit reports for
borrowers after these loans were originated.  (In particular, the feature
`last_credit_pull_d` gives the most recent date when a credit report was pulled for a
loan.)

### Example: classification is straightforward

Consider the descriptions of the following two features:
- `fico_range_high`:  The upper boundary range the borrower’s FICO at loan origination
belongs to.
- `last_fico_range_high`:  The upper boundary range the borrower’s last FICO pulled
belongs to.

Clearly `fico_range_high` was known at loan origination, but `last_fico_range_high` may
have come from a more recent credit report.

### Example: classification is ambiguous

For many other features associated with the borrower's credit report, however, the
description gives no hint about when the credit report was obtained.

For example:
- `bc_open_to_buy`:  Total open to buy on revolving bankcards.

Is this the total open credit for revolving bankcards _at the time of loan origination_?
 Or is it the total from a credit report that has been pulled more recently?

### Prevention of data leakage

Since data leakage would invalidate predictive models developed for this project, I have
assumed that for ambiguous cases such as `bc_open_to_buy`, the data was not known at
the time of loan origination.

## Explore features

Explore features for which the classification is initially unclear.

### `pymnt_plan`

What sort of payment plan is associated with the feature `pymnt_plan`?

In [None]:
pymnt_plan_counts = get_value_counts(loan_data["pymnt_plan"])
display(style_value_counts(pymnt_plan_counts))

The number of `True` values is relatively small, so `pymnt_plan` may be associated with
a hardship plan or settlement plan.

Select the rows for which `pymnt_plan` is `True` and check whether characteristic
features for hardship plan  settlement plan are non-null.

In [None]:
bool_index = loan_data["pymnt_plan"]
loans_with_payment_plan = loan_data[bool_index]

hardship_status_counts = get_value_counts(loans_with_payment_plan["hardship_status"])
display(style_value_counts(hardship_status_counts))

settlement_status_counts = get_value_counts(
    loans_with_payment_plan["settlement_status"]
)
display(style_value_counts(settlement_status_counts))

All loans with `pymnt_plan` equal to `True` have `hardship_status` as `ACTIVE` and
`settlement_status` as `<NA>`.

Compare the distribution of values for `hardship_status` across the full set of accepted
loans.

In [None]:
hardship_status_counts = get_value_counts(loan_data["hardship_status"])
display(style_value_counts(hardship_status_counts))

Conclusion:  `pymnt_plan` gives information about a hardship plan associated with the loan.

## Display features by group

### `borrower` features available at loan origination

In [None]:
bool_index = (loan_metadata["category"] == "borrower") & (
    loan_metadata["known at loan origination"] == "Y"
)
feature_count = len(loan_metadata[bool_index])
print(
    f"\n\n{feature_count} features available at loan origination "
    "characterize the borrower:\n\n"
)
display(style_loan_summary(loan_metadata.loc[bool_index, ["data type", "description"]]))

### `borrower` features not available at loan origination

In [None]:
bool_index = (loan_metadata["category"] == "borrower") & (
    loan_metadata["known at loan origination"] == "N"
)
feature_count = len(loan_metadata[bool_index])
print(
    f"\n\n{feature_count} features not available at loan origination "
    "characterize the borrower:\n\n"
)
style_loan_summary(loan_metadata.loc[bool_index, ["data type", "description"]])

### `co_borrower` features available at loan origination

For most features associated with the co-borrowers' credit history, the description
indicates that the information is known at the time of loan application. Also, none of
the `co_borrower` feature descriptions suggest that credit reports for the co-borrowers
continue to be pulled during the lifetime of the loan. So I assume that all features
characterizing the co-borrowers' credit history are known at loan origination.

Example: the co-borrower feature `sec_app_fico_range_low`, like the borrower feature
`fico_range_low`, is considered to be available at loan origination.

In [None]:
bool_index = (loan_metadata["category"] == "co_borrower") & (
    loan_metadata["known at loan origination"] == "Y"
)
feature_count = len(loan_metadata[bool_index])
print(
    f"\n\n{feature_count} features available at loan origination "
    "characterize the co-borrowers:\n\n"
)
style_loan_summary(loan_metadata.loc[bool_index, ["data type", "description"]])

### `loan` features available at loan origination

In [None]:
bool_index = (loan_metadata["category"] == "loan") & (
    loan_metadata["known at loan origination"] == "Y"
)
feature_count = len(loan_metadata[bool_index])
print(
    f"\n\n{feature_count} features available at loan origination "
    "characterize the loan:\n\n"
)
display(style_loan_summary(loan_metadata.loc[bool_index, ["data type", "description"]]))

### `loan` features not available at loan origination

In [None]:
bool_index = (loan_metadata["category"] == "loan") & (
    loan_metadata["known at loan origination"] == "N"
)
feature_count = len(loan_metadata[bool_index])
print(
    f"\n\n{feature_count} features not available at loan origination "
    "characterize the loan:\n\n"
)
display(style_loan_summary(loan_metadata.loc[bool_index, ["data type", "description"]]))

### `hardship_plan` features not available at loan origination

In [None]:
bool_index = (loan_metadata["category"] == "hardship_plan") & (
    loan_metadata["known at loan origination"] == "N"
)
feature_count = len(loan_metadata[bool_index])
print(
    f"\n\n{feature_count} features not available at loan origination "
    "characterize the hardship plan (in cases where one was created):\n\n"
)
display(style_loan_summary(loan_metadata.loc[bool_index, ["data type", "description"]]))

### `settlement_plan` features not available at loan origination

In [None]:
bool_index = (loan_metadata["category"] == "settlement_plan") & (
    loan_metadata["known at loan origination"] == "N"
)
feature_count = len(loan_metadata[bool_index])
print(
    f"\n\n{feature_count} features not available at loan origination "
    "characterize the settlement plan (in cases where one was created):\n\n"
)
display(style_loan_summary(loan_metadata.loc[bool_index, ["data type", "description"]]))

### `charge_off` features not available at loan origination

In [None]:
bool_index = (loan_metadata["category"] == "charge_off") & (
    loan_metadata["known at loan origination"] == "N"
)
feature_count = len(loan_metadata[bool_index])
print(
    f"\n\n{feature_count} features not available at loan origination "
    "characterize the charge-off (in cases where a charge-off occurred):\n\n"
)
display(style_loan_summary(loan_metadata.loc[bool_index, ["data type", "description"]]))

## Create a SQLite database

In [None]:
tables = {
    "loan_data": loan_data,
    "loan_metadata": loan_metadata.reset_index(),
}

In [None]:
create_database(tables)