To get used to the data I will copy and explore the contents of a starter notebook for the project, published by DANIEL HERMAN: https://www.kaggle.com/code/jetakow/home-credit-2024-starter-notebook

In [13]:
import sys
print(sys.path)

['/Users/dustinhayes/Desktop/GitHub/stable-credit-risk-modeling', '/Library/Frameworks/Python.framework/Versions/3.10/lib/python310.zip', '/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10', '/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/lib-dynload', '', '/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages']


In [1]:
import polars as pl
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score 

dataPath = "/kaggle/input/home-credit-credit-risk-model-stability/"

In [2]:
def set_table_dtypes(df: pl.DataFrame) -> pl.DataFrame:
    """
    This function accepts a dataframe and
    sets datatypes based on a naming convention.

    Setting datatypes manually can promote memory effeciency
    and improve speed.
    """
    for col in df.columns:
        # last letter of column name will help you determine the type
        if col[-1] in ("P", "A"):
            df = df.with_columns(pl.col(col).cast(pl.Float64).alias(col))

    return df

def convert_strings(df: pd.DataFrame) -> pd.DataFrame:
    """
    This function accepts a dataframe and
    sets 'string' and 'object' dtypes to 'category',
    which is more memory efficient.

    Setting datatypes manually can promote memory effeciency
    and improve speed. 
    """
    for col in df.columns:  
        if df[col].dtype.name in ['object', 'string']:
            df[col] = df[col].astype("string").astype('category')
            current_categories = df[col].cat.categories
            new_categories = current_categories.to_list() + ["Unknown"]
            new_dtype = pd.CategoricalDtype(categories=new_categories, ordered=True)
            df[col] = df[col].astype(new_dtype)
    return df

The next cell brings in the data required for training. The data is split into multiple tables. A base table contains the case ID which uniquely identifies each case. This case ID is used to join on feature tables which contain the features used for training.

As each dataframe is loaded in, "pipe" is called to converte the data types of each table, using the previously defined function "set_table_dtypes".

Question: There are many feature tables. Here we are only loading in a small subset: static_0_0, static_0_1, person_1 and credit_bureau_b_2. There are more than ten "credit_bureau" tables alone. How were these tables selected?

Answer: This question was asked in the comments. The (paraphrased) responses were:
1. Some data is not relevant. Worst case scenario is including too much data leads to overfitting.
2. This is only one way of doing it. Other approaches are valid.

The question remains: How was it decided that this particular subset of the data might be an effective choice? Futher investigation is required.

In [None]:
train_basetable = pl.read_csv(dataPath + "csv_files/train/train_base.csv")
train_static = pl.concat(
    [
        pl.read_csv(dataPath + "csv_files/train/train_static_0_0.csv").pipe(set_table_dtypes),
        pl.read_csv(dataPath + "csv_files/train/train_static_0_1.csv").pipe(set_table_dtypes),
    ],
    how="vertical_relaxed",
)
train_static_cb = pl.read_csv(dataPath + "csv_files/train/train_static_cb_0.csv").pipe(set_table_dtypes)
train_person_1 = pl.read_csv(dataPath + "csv_files/train/train_person_1.csv").pipe(set_table_dtypes) 
train_credit_bureau_b_2 = pl.read_csv(dataPath + "csv_files/train/train_credit_bureau_b_2.csv").pipe(set_table_dtypes)

The following cell performs the same operations as the previous cell, but on our test data.

In [None]:
test_basetable = pl.read_csv(dataPath + "csv_files/test/test_base.csv")
test_static = pl.concat(
    [
        pl.read_csv(dataPath + "csv_files/test/test_static_0_0.csv").pipe(set_table_dtypes),
        pl.read_csv(dataPath + "csv_files/test/test_static_0_1.csv").pipe(set_table_dtypes),
        pl.read_csv(dataPath + "csv_files/test/test_static_0_2.csv").pipe(set_table_dtypes),
    ],
    how="vertical_relaxed",
)
test_static_cb = pl.read_csv(dataPath + "csv_files/test/test_static_cb_0.csv").pipe(set_table_dtypes)
test_person_1 = pl.read_csv(dataPath + "csv_files/test/test_person_1.csv").pipe(set_table_dtypes) 
test_credit_bureau_b_2 = pl.read_csv(dataPath + "csv_files/test/test_credit_bureau_b_2.csv").pipe(set_table_dtypes) 