# Example Notebook

Welcome to the example notebook for the Home Credit Kaggle competition. The goal of this competition is to determine how likely a customer is going to default on an issued loan. The main difference between the [first](https://www.kaggle.com/c/home-credit-default-risk) and this competition is that now your submission will be scored with a custom metric that will take into account how well the model performs in future. A decline in performance will be penalized. The goal is to create a model that is stable and performs well in the future.

In this notebook you will see how to:
* Load the data
* Join tables with Polars - a DataFrame library implemented in Rust language, designed to be blazingy fast and memory efficient.  
* Create simple aggregation features
* Train a LightGBM model
* Create a submission table

In [1]:
import polars as pl
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score 

# 1. Import data

In [2]:
dataPath = "/kaggle/input/home-credit-credit-risk-model-stability/"

In [3]:
# example files

import os 
train_files = os.listdir(dataPath+'csv_files/train')

pd.read_csv(dataPath+'csv_files/train/'+train_files[3])

Unnamed: 0,case_id,addres_district_368M,addres_role_871L,addres_zip_823M,conts_role_79M,empls_economicalst_849M,empls_employedfrom_796D,empls_employer_name_740M,num_group1,num_group2,relatedpersons_role_762T
0,5,a55475b1,,a55475b1,a55475b1,a55475b1,,a55475b1,0,0,
1,6,P55_110_32,CONTACT,P10_68_40,P38_92_157,P164_110_33,,a55475b1,0,0,
2,6,P55_110_32,PERMANENT,P10_68_40,a55475b1,a55475b1,,a55475b1,0,1,
3,6,P204_92_178,CONTACT,P65_136_169,P38_92_157,P164_110_33,,a55475b1,1,0,OTHER_RELATIVE
4,6,P191_109_75,CONTACT,P10_68_40,P7_147_157,a55475b1,,a55475b1,1,1,OTHER_RELATIVE
...,...,...,...,...,...,...,...,...,...,...,...
1643405,2703450,a55475b1,,a55475b1,a55475b1,a55475b1,,a55475b1,0,0,
1643406,2703451,a55475b1,,a55475b1,a55475b1,a55475b1,,a55475b1,0,0,
1643407,2703452,a55475b1,,a55475b1,a55475b1,a55475b1,,a55475b1,0,0,
1643408,2703453,a55475b1,,a55475b1,a55475b1,a55475b1,,a55475b1,0,0,


In [4]:
def set_table_dtypes(df: pl.DataFrame) -> pl.DataFrame:
    # implement here all desired dtypes for tables
    # the following is just an example
    for col in df.columns:
        # last letter of column name will help you determine the type
        if col[-1] in ("P", "A"):
            df = df.with_columns(pl.col(col).cast(pl.Float64).alias(col))

    return df

def convert_strings(df: pd.DataFrame) -> pd.DataFrame:
    for col in df.columns:  
        if df[col].dtype.name in ['object', 'string']:
            df[col] = df[col].astype("string").astype('category')
            current_categories = df[col].cat.categories
            new_categories = current_categories.to_list() + ["Unknown"]
            new_dtype = pd.CategoricalDtype(categories=new_categories, ordered=True)
            df[col] = df[col].astype(new_dtype)
    return df

Let us define a function to import the tables.

This is an example, not all the tabels are imported

In [5]:
train_files = os.listdir(dataPath+'csv_files/train')
test_files = os.listdir(dataPath+'csv_files/test')

print('Train files')
print('======================')
for file in train_files:
    print(file)

print('Test files')
print('======================')
for file in test_files:
    print(file)

Train files
train_credit_bureau_a_1_3.csv
train_static_cb_0.csv
train_applprev_1_0.csv
train_person_2.csv
train_base.csv
train_tax_registry_a_1.csv
train_static_0_0.csv
train_credit_bureau_a_1_0.csv
train_applprev_2.csv
train_credit_bureau_a_2_6.csv
train_credit_bureau_a_1_2.csv
train_person_1.csv
train_credit_bureau_a_1_1.csv
train_tax_registry_c_1.csv
train_credit_bureau_a_2_4.csv
train_credit_bureau_a_2_9.csv
train_credit_bureau_a_2_3.csv
train_credit_bureau_a_2_7.csv
train_credit_bureau_b_2.csv
train_credit_bureau_a_2_2.csv
train_static_0_1.csv
train_deposit_1.csv
train_credit_bureau_a_2_10.csv
train_tax_registry_b_1.csv
train_applprev_1_1.csv
train_credit_bureau_a_2_1.csv
train_credit_bureau_a_2_8.csv
train_credit_bureau_a_2_5.csv
train_credit_bureau_b_1.csv
train_credit_bureau_a_2_0.csv
train_other_1.csv
train_debitcard_1.csv
Test files
test_credit_bureau_a_2_6.csv
test_credit_bureau_a_2_11.csv
test_credit_bureau_a_2_0.csv
test_credit_bureau_a_2_9.csv
test_base.csv
test_credit_bu

In [6]:
def import_tabels(folder):
    basetable = pl.read_csv(dataPath + "csv_files/"+str(folder)+"/"+str(folder)+"_base.csv")
    static = pl.concat(
        [
            pl.read_csv(dataPath + "csv_files/"+str(folder)+"/"+str(folder)+"_static_0_0.csv").pipe(set_table_dtypes),
            pl.read_csv(dataPath + "csv_files/"+str(folder)+"/"+str(folder)+"_static_0_1.csv").pipe(set_table_dtypes),
        ],
        how="vertical_relaxed",
    )
    static_cb = pl.read_csv(dataPath + "csv_files/"+str(folder)+"/"+str(folder)+"_static_cb_0.csv").pipe(set_table_dtypes)
    person_1 = pl.read_csv(dataPath + "csv_files/"+str(folder)+"/"+str(folder)+"_person_1.csv").pipe(set_table_dtypes) 
    credit_bureau_b_2 = pl.read_csv(dataPath + "csv_files/"+str(folder)+"/"+str(folder)+"_credit_bureau_b_2.csv").pipe(set_table_dtypes) 
    
    return basetable, static, static_cb, person_1, credit_bureau_b_2

In [7]:
train_basetable, train_static, train_static_cb, train_person_1, train_credit_bureau_b_2 = import_tabels(folder="train")
test_basetable, test_static, test_static_cb, test_person_1, test_credit_bureau_b_2 = import_tabels(folder="test")

## 2. Data pre-processing / Feature engineering

In this part, we can see a simple example of joining tables via `case_id`. Here the loading and joining is done with polars library. Polars library is blazingly fast and has much smaller memory footprint than pandas. 

In [8]:
def preprocessing_pipline(basetable, static, static_cb, person_1, credit_bureau_b_2):
    # We need to use aggregation functions in tables with depth > 1, so tables that contain num_group1 column or 
    # also num_group2 column.
    person_1_feats_1 = person_1.group_by("case_id").agg(
        pl.col("mainoccupationinc_384A").max().alias("mainoccupationinc_384A_max"),
        (pl.col("incometype_1044T") == "SELFEMPLOYED").max().alias("mainoccupationinc_384A_any_selfemployed")
    )

    # Here num_group1=0 has special meaning, it is the person who applied for the loan.
    person_1_feats_2 = person_1.select(["case_id", "num_group1", "housetype_905L"]).filter(
        pl.col("num_group1") == 0
    ).drop("num_group1").rename({"housetype_905L": "person_housetype"})

    # Here we have num_goup1 and num_group2, so we need to aggregate again.
    credit_bureau_b_2_feats = credit_bureau_b_2.group_by("case_id").agg(
        pl.col("pmts_pmtsoverdue_635A").max().alias("pmts_pmtsoverdue_635A_max"),
        (pl.col("pmts_dpdvalue_108P") > 31).max().alias("pmts_dpdvalue_108P_over31")
    )

    # We will process in this examples only A-type and M-type columns, so we need to select them.
    selected_static_cols = []
    for col in static.columns:
        if col[-1] in ("A", "M"):
            selected_static_cols.append(col)
    print(selected_static_cols)

    selected_static_cb_cols = []
    for col in static_cb.columns:
        if col[-1] in ("A", "M"):
            selected_static_cb_cols.append(col)
    print(selected_static_cb_cols)

    # Join all tables together.
    data = basetable.join(
        static.select(["case_id"]+selected_static_cols), how="left", on="case_id"
    ).join(
        static_cb.select(["case_id"]+selected_static_cb_cols), how="left", on="case_id"
    ).join(
        person_1_feats_1, how="left", on="case_id"
    ).join(
        person_1_feats_2, how="left", on="case_id"
    ).join(
        credit_bureau_b_2_feats, how="left", on="case_id"
    )
    
    return data

In [9]:
data = preprocessing_pipline(train_basetable, 
                             train_static, 
                             train_static_cb, 
                             train_person_1, 
                             train_credit_bureau_b_2)


data_submission = preprocessing_pipline(test_basetable, 
                                        test_static, 
                                        test_static_cb, 
                                        test_person_1, 
                                        test_credit_bureau_b_2)

['amtinstpaidbefduel24m_4187115A', 'annuity_780A', 'annuitynextmonth_57A', 'avginstallast24m_3658937A', 'avglnamtstart24m_4525187A', 'avgoutstandbalancel6m_4187114A', 'avgpmtlast12m_4525200A', 'credamount_770A', 'currdebt_22A', 'currdebtcredtyperange_828A', 'disbursedcredamount_1113A', 'downpmt_116A', 'inittransactionamount_650A', 'lastapprcommoditycat_1041M', 'lastapprcommoditytypec_5251766M', 'lastapprcredamount_781A', 'lastcancelreason_561M', 'lastotherinc_902A', 'lastotherlnsexpense_631A', 'lastrejectcommoditycat_161M', 'lastrejectcommodtypec_5251769M', 'lastrejectcredamount_222A', 'lastrejectreason_759M', 'lastrejectreasonclient_4145040M', 'maininc_215A', 'maxannuity_159A', 'maxannuity_4075009A', 'maxdebt4_972A', 'maxinstallast24m_3658928A', 'maxlnamtstart6m_4525199A', 'maxoutstandbalancel12m_4187113A', 'maxpmtlast3m_4525190A', 'previouscontdistrict_112M', 'price_1097A', 'sumoutstandtotal_3546847A', 'sumoutstandtotalest_4493215A', 'totaldebt_9A', 'totalsettled_863A', 'totinstallas

# 3. Prepare data for ML
Create train, validation and test datasets

In [10]:
case_ids = data["case_id"].unique().shuffle(seed=1)
case_ids_train, case_ids_test = train_test_split(case_ids, train_size=0.6, random_state=1)
case_ids_valid, case_ids_test = train_test_split(case_ids_test, train_size=0.5, random_state=1)

cols_pred = []
for col in data.columns:
    if col[-1].isupper() and col[:-1].islower():
        cols_pred.append(col)

print(cols_pred)

def from_polars_to_pandas(case_ids: pl.DataFrame) -> pl.DataFrame:
    return (
        data.filter(pl.col("case_id").is_in(case_ids))[["case_id", "WEEK_NUM", "target"]].to_pandas(),
        data.filter(pl.col("case_id").is_in(case_ids))[cols_pred].to_pandas(),
        data.filter(pl.col("case_id").is_in(case_ids))["target"].to_pandas()
    )

base_train, X_train, y_train = from_polars_to_pandas(case_ids_train)
base_valid, X_valid, y_valid = from_polars_to_pandas(case_ids_valid)
base_test, X_test, y_test = from_polars_to_pandas(case_ids_test)

for df in [X_train, X_valid, X_test]:
    df = convert_strings(df)

['amtinstpaidbefduel24m_4187115A', 'annuity_780A', 'annuitynextmonth_57A', 'avginstallast24m_3658937A', 'avglnamtstart24m_4525187A', 'avgoutstandbalancel6m_4187114A', 'avgpmtlast12m_4525200A', 'credamount_770A', 'currdebt_22A', 'currdebtcredtyperange_828A', 'disbursedcredamount_1113A', 'downpmt_116A', 'inittransactionamount_650A', 'lastapprcommoditycat_1041M', 'lastapprcommoditytypec_5251766M', 'lastapprcredamount_781A', 'lastcancelreason_561M', 'lastotherinc_902A', 'lastotherlnsexpense_631A', 'lastrejectcommoditycat_161M', 'lastrejectcommodtypec_5251769M', 'lastrejectcredamount_222A', 'lastrejectreason_759M', 'lastrejectreasonclient_4145040M', 'maininc_215A', 'maxannuity_159A', 'maxannuity_4075009A', 'maxdebt4_972A', 'maxinstallast24m_3658928A', 'maxlnamtstart6m_4525199A', 'maxoutstandbalancel12m_4187113A', 'maxpmtlast3m_4525190A', 'previouscontdistrict_112M', 'price_1097A', 'sumoutstandtotal_3546847A', 'sumoutstandtotalest_4493215A', 'totaldebt_9A', 'totalsettled_863A', 'totinstallas

In [11]:
print(f"Train: {X_train.shape}")
print(f"Valid: {X_valid.shape}")
print(f"Test: {X_test.shape}")

Train: (915995, 48)
Valid: (305332, 48)
Test: (305332, 48)


## 4. Training LightGBM

Minimal example of LightGBM training is shown below.

In [31]:
params = {
    "boosting_type": "gbdt",
    "objective": "binary",
    "metric": "auc",
    "max_depth": 3,
    "num_leaves": 31,
    "learning_rate": 0.05,
    "feature_fraction": 0.9,
    "bagging_fraction": 0.8,
    "bagging_freq": 5,
    "n_estimators": 1000,
    "verbose": -1,
    "class_weight": "balanced",
}

from lightgbm import LGBMClassifier

gbm = LGBMClassifier(**params)

gbm.fit(X_train, y_train, 
        eval_set = (X_valid, y_valid),
        early_stopping_rounds = 50,
        verbose=20)



[20]	valid_0's auc: 0.686749
[40]	valid_0's auc: 0.704062
[60]	valid_0's auc: 0.713111
[80]	valid_0's auc: 0.720393
[100]	valid_0's auc: 0.725278
[120]	valid_0's auc: 0.728326
[140]	valid_0's auc: 0.731096
[160]	valid_0's auc: 0.733286
[180]	valid_0's auc: 0.734569
[200]	valid_0's auc: 0.736291
[220]	valid_0's auc: 0.738115
[240]	valid_0's auc: 0.739339
[260]	valid_0's auc: 0.740364
[280]	valid_0's auc: 0.741253
[300]	valid_0's auc: 0.742107
[320]	valid_0's auc: 0.743115
[340]	valid_0's auc: 0.743762
[360]	valid_0's auc: 0.744681
[380]	valid_0's auc: 0.745262
[400]	valid_0's auc: 0.745916
[420]	valid_0's auc: 0.746207
[440]	valid_0's auc: 0.746814
[460]	valid_0's auc: 0.747427
[480]	valid_0's auc: 0.747962
[500]	valid_0's auc: 0.748288
[520]	valid_0's auc: 0.748705
[540]	valid_0's auc: 0.749118
[560]	valid_0's auc: 0.749451
[580]	valid_0's auc: 0.749889
[600]	valid_0's auc: 0.750456
[620]	valid_0's auc: 0.75094
[640]	valid_0's auc: 0.751211
[660]	valid_0's auc: 0.751326
[680]	valid_0's

Evaluation with AUC and then comparison with the stability metric is shown below.

In [35]:
for base, X in [(base_train, X_train), (base_valid, X_valid), (base_test, X_test)]:
    y_pred = gbm.predict_proba(X)[:, 1]
    base["score"] = y_pred

print(f'The AUC score on the train set is: {roc_auc_score(base_train["target"], base_train["score"])}') 
print(f'The AUC score on the valid set is: {roc_auc_score(base_valid["target"], base_valid["score"])}') 
print(f'The AUC score on the test set is: {roc_auc_score(base_test["target"], base_test["score"])}')  

The AUC score on the train set is: 0.7715510616958967
The AUC score on the valid set is: 0.7545667088666379
The AUC score on the test set is: 0.7508559478935085


In [36]:
def gini_stability(base, w_fallingrate=88.0, w_resstd=-0.5):
    gini_in_time = base.loc[:, ["WEEK_NUM", "target", "score"]]\
        .sort_values("WEEK_NUM")\
        .groupby("WEEK_NUM")[["target", "score"]]\
        .apply(lambda x: 2*roc_auc_score(x["target"], x["score"])-1).tolist()
    
    x = np.arange(len(gini_in_time))
    y = gini_in_time
    a, b = np.polyfit(x, y, 1)
    y_hat = a*x + b
    residuals = y - y_hat
    res_std = np.std(residuals)
    avg_gini = np.mean(gini_in_time)
    return avg_gini + w_fallingrate * min(0, a) + w_resstd * res_std

stability_score_train = gini_stability(base_train)
stability_score_valid = gini_stability(base_valid)
stability_score_test = gini_stability(base_test)

print(f'The stability score on the train set is: {stability_score_train}') 
print(f'The stability score on the valid set is: {stability_score_valid}') 
print(f'The stability score on the test set is: {stability_score_test}')  

The stability score on the train set is: 0.5146956546236384
The stability score on the valid set is: 0.4788056052181807
The stability score on the test set is: 0.4636066937102159


# 5. Submit predictions

Scoring the submission dataset is below, we need to take care of new categories. Then we save the score as a last step. 

In [37]:
X_submission = data_submission[cols_pred].to_pandas()
X_submission = convert_strings(X_submission)
categorical_cols = X_train.select_dtypes(include=['category']).columns

for col in categorical_cols:
    train_categories = set(X_train[col].cat.categories)
    submission_categories = set(X_submission[col].cat.categories)
    new_categories = submission_categories - train_categories
    X_submission.loc[X_submission[col].isin(new_categories), col] = "Unknown"
    new_dtype = pd.CategoricalDtype(categories=train_categories, ordered=True)
    X_train[col] = X_train[col].astype(new_dtype)
    X_submission[col] = X_submission[col].astype(new_dtype)

y_submission_pred = gbm.predict(X_submission)

In [40]:
submission = pd.DataFrame({
    "case_id": data_submission["case_id"].to_numpy(),
    "score": y_submission_pred
}).set_index('case_id')
submission.to_csv("./submission.csv")

submission.head()

Unnamed: 0_level_0,score
case_id,Unnamed: 1_level_1
57543,1
57549,1
57551,1
57552,0
57569,0


Best of luck, and most importantly, enjoy the process of learning and discovery! 

<img src="https://i.imgur.com/obVWIBh.png" alt="Image" width="700"/>