# Define data types, reformat data, and create the database
- [Load the data](#Load-the-data)
- [Summarize / describe the data](#Summarize-/-describe-the-data)
- [Reformat data](#Reformat-data)
- [Create the database](#Create-the-database)

The LendingClub [dataset](https://www.kaggle.com/datasets/wordsforthewise/lending-club)
includes data on accepted and rejected loans from 2007 through 2018 Q2.

This notebook does the following:
- After preliminary exploration of the data, define a data type for each feature.
- Convert / reformat some columns to facilitate model development.
- Create a SQLite database.

In [None]:
import sqlite3
from calendar import month_name
from pathlib import Path

import numpy as np
import pandas as pd
from IPython.display import display

## Load the data

I manually chose a pandas `dtype` for each column of data on accepted and rejected
loans.
- For the table of accepted loans, which has 151 columns, I found and downloaded a table
  of [column
  descriptions](https://www.kaggle.com/datasets/ethon0426/lending-club-20072020q1?select=LCDataDictionary.xlsx).
  Taking account of the column descriptions as well as characteristic values displayed
  by `pandas.DataFrame.describe`, I chose a `dtype` for each column.  To store the
  `dtype` choices, I added a column named 'data type' to the downloaded table of column
  descriptions. The values in this column are string aliases for pandas dtypes.
- For the table of rejected loans, which has 9 columns, I inspected the table and then
  created a dictionary mapping column names to `dtype` aliases.

In [None]:
data_folder = Path("../data/")
acc_loans_path = data_folder / "accepted_2007_to_2018Q4.csv"
rej_loans_path = data_folder / "rejected_2007_to_2018Q4.csv"
acc_loans_col_desc_path = data_folder / "LCDataDictionaryWithDtypes.csv"

Record the number of columns in each of the tables of loan data, in order to check that
no columns are accidentally excluded from the database.

Note that the column named 'member_id' in the data on accepted loans is empty,
so we exclude it in loading the data.

In [None]:
acc_loans_columns = pd.read_csv(
    acc_loans_path, nrows=0, usecols=lambda col_name: col_name != "member_id"
).columns
ncols_acc_loans = len(acc_loans_columns)
ncols_acc_loans

In [None]:
rej_loans_columns = pd.read_csv(rej_loans_path, nrows=0).columns
ncols_rej_loans = len(rej_loans_columns)
ncols_rej_loans

In [None]:
acc_loans_col_desc = pd.read_csv(acc_loans_col_desc_path)
acc_loans_col_desc = acc_loans_col_desc.set_index("column name")

In [None]:
acc_loans_col_desc.head()

In [None]:
acc_loans_dtypes = acc_loans_col_desc["data type"].to_dict()

In [None]:
rej_loans_dtypes = {
    "Amount Requested": "Float64",
    "Application Date": "string",
    "Loan Title": "string",
    "Risk_Score": "Float64",
    "Debt-To-Income Ratio": "string",
    "Zip Code": "string",
    "State": "string",
    "Employment Length": "string",
    "Policy Code": "string",
}

In [None]:
acc_loans = pd.read_csv(
    acc_loans_path,
    dtype=acc_loans_dtypes,
    usecols=lambda col_name: col_name != "member_id",
)

In [None]:
rej_loans = pd.read_csv(rej_loans_path, dtype=rej_loans_dtypes)

## Summarize / describe the data

Verify that the `dtype` of all columns corresponds to one of the aliases that I
assigned: 'string', 'Int64', 'Float64'.

In [None]:
acc_loans.dtypes.unique()

In [None]:
rej_loans.dtypes.unique()

In summarizing the data, generate separate summary dataframes for numeric columns and
string columns.

In [None]:
def summarize_loan_data(df, dtype):
    summary = df.describe(include=dtype).transpose()
    dtypes_df = df.dtypes.to_frame(name="data type")
    summary = summary.join(dtypes_df)
    summary["count"] = summary["count"].astype("int")
    return summary

In [None]:
def summarize_acc_loans(df, dtype):
    summary = summarize_loan_data(df, dtype)
    return summary.join(acc_loans_col_desc["description"])

In [None]:
def style_loan_summary(df):
    styler = (
        df.style.set_properties(**{"text-align": "center"})
        .map_index(lambda _heading: "text-align: center;", axis="rows")
        .map_index(lambda _heading: "text-align: center;", axis="columns")
    )
    if "description" in df.columns:
        styler = styler.set_properties(
            subset="description", **{"text-align": "left", "white-space": "normal"}
        )
    if "std" in df.columns:
        styler = styler.format(precision=1, thousands=",", decimal=".").format(
            precision=1, subset="std"
        )
    return styler

In [None]:
acc_loans_summary_numeric = summarize_acc_loans(acc_loans, np.number)
display(style_loan_summary(acc_loans_summary_numeric))

In [None]:
acc_loans_summary_string = summarize_acc_loans(acc_loans, "string")
display(style_loan_summary(acc_loans_summary_string))

Verify that the combined number of rows in the two summary dataframes for the accepted
loans equals the number of columns initially loaded into the table.

In [None]:
rej_loans_summary_numeric = summarize_loan_data(rej_loans, np.number)
display(style_loan_summary(rej_loans_summary_numeric))

In [None]:
rej_loans_summary_string = summarize_loan_data(rej_loans, "string")
display(style_loan_summary(rej_loans_summary_string))

For both the accepted loans and the rejected loans, verify that the combined number of
rows in the two summary dataframes equals the number of columns initially loaded.

In [None]:
assert ncols_acc_loans == len(acc_loans_summary_numeric) + len(acc_loans_summary_string)
assert ncols_rej_loans == len(rej_loans_summary_numeric) + len(rej_loans_summary_string)

## Reformat data

For several of the columns of type 'string,' such as columns containing dates, data conversion is needed.
I'll use a series of ad-hoc commands for the data conversion.

Convert elements of the 'term' column for accepted loans from strings (e.g., '36 months') to integers (e.g., 36).

In [None]:
acc_loans["term"] = (
    acc_loans["term"].str.replace("months", "").str.strip().astype("Int64")
)

In [None]:
acc_loans["term"].unique()

Convert date strings to ISO format (e.g., convert 'Jan-2015' to '2015-01').

In [None]:
# The array of capitalized month names in chronological order provided by the calendar
# module has the empty string as the first element, which is discarded.
ordered_months = list(month_name)[1:]

# Dictionary used in converting abbreviated month names to fmonth numbers in ISO format.
iso_month_labels = {
    month.lower()[:3]: format(index + 1, "02")
    for index, month in enumerate(ordered_months)
}


def get_iso_date_string(element):
    month, year = element.lower().split("-")
    return year + "-" + iso_month_labels[month]

In [None]:
iso_month_labels

In [None]:
date_columns = [
    "issue_d",
    "earliest_cr_line",
    "last_pymnt_d",
    "next_pymnt_d",
    "last_credit_pull_d",
    "sec_app_earliest_cr_line",
    "hardship_start_date",
    "hardship_end_date",
    "payment_plan_start_date",
    "debt_settlement_flag_date",
    "settlement_date",
]

for col_name in date_columns:
    acc_loans[col_name] = (
        acc_loans[col_name]
        .map(get_iso_date_string, na_action="ignore")
        .astype("string")
    )

Convert yes/no values given as string to boolean elements.

In [None]:
boolean_columns = ["pymnt_plan", "hardship_flag", "debt_settlement_flag"]

In [None]:
mapper = {"N": False, "Y": True}
for col_name in boolean_columns:
    acc_loans[col_name] = (
        acc_loans[col_name]
        .str.upper()
        .map(mapper, na_action="ignore")
        .astype("boolean")
    )

Convert strings representing percentages to floats.

In [None]:
rej_loans["Debt-To-Income Ratio"] = (
    rej_loans["Debt-To-Income Ratio"].str.replace("%", "").astype("Float64")
)

In [None]:
rej_loans["Debt-To-Income Ratio"].unique()

After doing the data conversions, recreate the summary tables.

In [None]:
acc_loans_summary_numeric = summarize_acc_loans(acc_loans, np.number)
display(style_loan_summary(acc_loans_summary_numeric))

In [None]:
acc_loans_summary_string = summarize_acc_loans(acc_loans, "string")
display(style_loan_summary(acc_loans_summary_string))

In [None]:
acc_loans_summary_boolean = summarize_acc_loans(acc_loans, "boolean")
display(style_loan_summary(acc_loans_summary_boolean))

In [None]:
rej_loans_summary_numeric = summarize_loan_data(rej_loans, np.number)
display(style_loan_summary(rej_loans_summary_numeric))

In [None]:
rej_loans_summary_string = summarize_loan_data(rej_loans, "string")
display(style_loan_summary(rej_loans_summary_string))

## Create the database

In [None]:
db_path = data_folder / "lending-club.sqlite"

In [None]:
def create_acc_loans_metadata():
    dtypes_df = acc_loans.dtypes.map(str).to_frame(name="data type")
    metadata_df = dtypes_df.join(acc_loans_col_desc[["description"]])
    metadata_df.index.rename("column name", inplace=True)
    return metadata_df


acc_loans_metadata = create_acc_loans_metadata()

In [None]:
acc_loans_metadata

In [None]:
def create_rej_loans_metadata():
    metadata_df = rej_loans.dtypes.map(str).to_frame(name="data type")
    metadata_df.index.rename("column name", inplace=True)
    return metadata_df


rej_loans_metadata = create_rej_loans_metadata()

In [None]:
rej_loans_metadata

In [None]:
def create_database():
    if db_path.exists():
        response = input(
            f"The database {db_path} already exists.  "
            "Do you wish to replace it (yes/no)? "
        )
        if response == "yes":
            db_path.unlink()
        else:
            print("\nReturning.\n")
            return
    add_tables()


def add_tables():
    db_conn = sqlite3.connect(db_path)
    with db_conn:
        # It's not clear how useful the data on rejected loans will be, so initially
        # just store data on accepted loans in the database.
        acc_loans_metadata.to_sql("loan_metadata", con=db_conn, index=True)
        acc_loans.to_sql("loan_data", con=db_conn, index=False)
    db_conn.close()

In [None]:
create_database()