# Etude et amélioration du kernel Kaggle [“**LightGBM with Simple Features**”](https://www.kaggle.com/code/jsaguiar/lightgbm-with-simple-features)

Le but est de comprendre les traitements effectués par ce noyau, puis de les améliorer (fonctionnellement et techniquement).

L'annexe focalise sur chaque étape clé de traitement.

La première partie applique les conclusions et améliorations justifiées par l'annexe, teste la non régression fonctionnelle et compare les performances pour mettre en évidence les gains.

## A comprendre :

In [None]:
import time
from contextlib import contextmanager

@contextmanager
def timer(title):
    t0 = time.time()
    yield
    print("{} - done in {:.0f}s".format(title, time.time() - t0))

# Reprise et amélioration des fonctionnalités

## Intro et imports

```Python
# From : https://www.kaggle.com/code/jsaguiar/lightgbm-with-simple-features
# kaggle kernels output jsaguiar/lightgbm-with-simple-features -p /path/to/dest

# HOME CREDIT DEFAULT RISK COMPETITION
# Most features are created by applying min, max, mean, sum and var functions to grouped tables. 
# Little feature selection is done and overfitting might be a problem since many features are related.
# The following key ideas were used:
# - Divide or subtract important features to get rates (like annuity and income)
# - In Bureau Data: create specific features for Active credits and Closed credits
# - In Previous Applications: create specific features for Approved and Refused applications
# - Modularity: one function for each table (except bureau_balance and application_test)
# - One-hot encoding for categorical features
# All tables are joined with the application DF using the SK_ID_CURR key (except bureau_balance).
# You can use LightGBM with KFold or Stratified KFold.

# Update 16/06/2018:
# - Added Payment Rate feature
# - Removed index from features
# - Use standard KFold CV (not stratified)

import numpy as np
import pandas as pd
import gc
import time
from contextlib import contextmanager
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import KFold, StratifiedKFold
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
```

In [44]:
%pip install lightgbm

Collecting lightgbm
  Downloading lightgbm-3.3.5-py3-none-win_amd64.whl (1.0 MB)
     ---------------------------------------- 0.0/1.0 MB ? eta -:--:--
     ------ --------------------------------- 0.2/1.0 MB 5.1 MB/s eta 0:00:01
     ---------------------------------------- 1.0/1.0 MB 12.8 MB/s eta 0:00:00
Collecting wheel
  Using cached wheel-0.40.0-py3-none-any.whl (64 kB)
Installing collected packages: wheel, lightgbm
Successfully installed lightgbm-3.3.5 wheel-0.40.0
Note: you may need to restart the kernel to use updated packages.


In [1]:
import lightgbm as lgb

## `timer`

```Python
from contextlib import contextmanager

@contextmanager
def timer(title):
    t0 = time.time()
    yield
    print("{} - done in {:.0f}s".format(title, time.time() - t0))
```

## `one_hot_encoder`

```Python
# One-hot encoding for categorical columns with get_dummies
def one_hot_encoder(df, nan_as_category = True):
    original_columns = list(df.columns)
    categorical_columns = [col for col in df.columns if df[col].dtype == 'object']
    df = pd.get_dummies(df, columns= categorical_columns, dummy_na= nan_as_category)
    new_columns = [c for c in df.columns if c not in original_columns]
    return df, new_columns
```

In [2]:
from home_credit.load import get_application_train

df = get_application_train()
display(df.head(3))

load C:\Users\franc\Projects\pepper_credit_scoring_tool\dataset\pqt\application_train.pqt


application_train,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [9]:
from home_credit.lightgbm_kernel import one_hot_encoder as oh_encode_1
from home_credit.lightgbm_kernel_v2 import one_hot_encode_all_cats as oh_encode_2

ohe_1, _ = oh_encode_1(df)

# Paramètres modifiés pour produire une sortie identique
# à celle de la version de référence
# NB > on ne retourne pas la liste des nouvelles colonnes car :
# 1/ cette information n'est pas systématiquement utilisée
# 2/ peut être déterminée si besoin à partir du résultat de l'appel de la fonction
# avec un ohe.columns.difference(df.columns)
ohe_2 = oh_encode_2(df, drop_first=False, dtype=bool, sparse=False, discard_constants=False)

ohe_1_not_in_ohe_2 = ohe_1.columns.difference(ohe_2.columns)
ohe_2_not_in_ohe_1 = ohe_2.columns.difference(ohe_1.columns)

if len(ohe_1_not_in_ohe_2) + len(ohe_1_not_in_ohe_2) > 0:
    print("`ohe_1` and `ohe_2` have different cols :")
    if len(ohe_1_not_in_ohe_2):
        print("`ohe_1` columns not in `ohe_2`", ohe_1_not_in_ohe_2)
    if len(ohe_2_not_in_ohe_1):
        print("`ohe_2` columns not in `ohe_1`", ohe_2_not_in_ohe_1)
else:
    print("same result:", all(ohe_1 == ohe_2))


same result: True


In [6]:
ohe_1.CODE_GENDER_nan.value_counts(dropna=False)

CODE_GENDER_nan
False    307511
Name: count, dtype: int64

## `application_train_test`

```Python
# Preprocess application_train.csv and application_test.csv
def application_train_test(num_rows = None, nan_as_category = False):
    # Read data and merge
    df = pd.read_csv('../input/application_train.csv', nrows= num_rows)
    test_df = pd.read_csv('../input/application_test.csv', nrows= num_rows)
    print("Train samples: {}, test samples: {}".format(len(df), len(test_df)))
    df = df.append(test_df).reset_index()
    # Optional: Remove 4 applications with XNA CODE_GENDER (train set)
    df = df[df['CODE_GENDER'] != 'XNA']
    
    # Categorical features with Binary encode (0 or 1; two categories)
    for bin_feature in ['CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY']:
        df[bin_feature], uniques = pd.factorize(df[bin_feature])

    # Categorical features with One-Hot encode
    df, cat_cols = one_hot_encoder(df, nan_as_category)
    
    # NaN values for DAYS_EMPLOYED: 365.243 -> nan
    df['DAYS_EMPLOYED'].replace(365243, np.nan, inplace= True)
    # Some simple new features (percentages)
    df['DAYS_EMPLOYED_PERC'] = df['DAYS_EMPLOYED'] / df['DAYS_BIRTH']
    df['INCOME_CREDIT_PERC'] = df['AMT_INCOME_TOTAL'] / df['AMT_CREDIT']
    df['INCOME_PER_PERSON'] = df['AMT_INCOME_TOTAL'] / df['CNT_FAM_MEMBERS']
    df['ANNUITY_INCOME_PERC'] = df['AMT_ANNUITY'] / df['AMT_INCOME_TOTAL']
    df['PAYMENT_RATE'] = df['AMT_ANNUITY'] / df['AMT_CREDIT']
    del test_df
    gc.collect()
    return df
```

### Chargement des tables

**Attention** Pour pouvoir comparer les résultats des deux modes de chargement, il faut avoir en tête que `get_application` effectue un `application.sort_index(inplace=True)`.

Il faut donc, en vue de comparer, réaligner les index sur ceux de la v1 : c'est le rôle de la fonction `pepper.pd_utils.align_df2_on_df1`.

#### Version d'origine

In [2]:
import pandas as pd
def load_application_v1(nrows=None):
    # Read data and merge
    df = pd.read_csv('../../dataset/csv/application_train.csv', nrows=nrows)
    test_df = pd.read_csv('../../dataset/csv/application_test.csv', nrows=nrows)
    # print(f"Train samples: {len(df)}, test samples: {len(test_df)}")
    # NB: `append` doesn't exist in current Pandas 2.0, replaced by `concat`
    #     `append` has been deprecated since version 1.3.0 of Pandas (June 2021)
    # NB2: A reset_index() statement in older code (< 1.3.0) is equivalent to reset_index(drop=True)
    # in modern code, due to the change in the default value of the drop parameter.
    #df = df.append(test_df).reset_index()
    df = pd.concat([df, test_df], axis=0)
    df = df.reset_index(drop=True)
    return df

#### Nouvelle version

Un wrapper de `get_table` avec un adaptation du type et des valeurs de `TARGET`.

In [3]:
from home_credit.utils import get_table
import numpy as np
def load_application_v2(nrows=None):
    data = get_table("application").copy()
    data.TARGET = data.TARGET.astype(object).replace(-1, np.nan)
    # If used, not the same as v1
    return data if nrows is None else data.sample(nrows)

#### Comparaison

In [6]:
from pepper.pd_utils import align_df2_on_df1
app_v1 = load_application_v1()
app_v2 = align_df2_on_df1("SK_ID_CURR", app_v1, load_application_v2())
# display(app_v1)
# display(app_v2)

In [5]:
from pepper.pd_utils import df_neq
is_diff = df_neq(app_v1, app_v2)
print("n_diffs:", is_diff.sum().sum())
# print("n_diffs by cols:\n", is_diff.sum(), sep="")
# print("n_diffs by rows:\n", is_diff.sum(axis=1), sep="")
# display(is_diff.any())
# display(is_diff.any(axis=1))

n_diffs: 0


In [5]:
dtypes_diff = app_v1.dtypes != app_v2.dtypes
if (~dtypes_diff).all():
    print("dtypes are aligned")
else:
    print("dtypes diffs:")
    display(app_v1.dtypes[dtypes_diff])
    display(app_v2.dtypes[dtypes_diff])

dtypes are aligned


### Nettoyage des catégories

Suppression ou correction des aberrations et valeurs manquantes pour les catégories.

In [13]:
def clean_cats_v1(df):
    # Optional: Remove 4 applications with XNA CODE_GENDER (train set)
    return df[df['CODE_GENDER'] != 'XNA']

In [12]:
def clean_cats_v2(data):
    # Optional: Remove 4 applications with XNA CODE_GENDER (train set)
    data.drop(index=data.index[data.CODE_GENDER == "XNA"], inplace=True)
    return data

In [14]:
clean_app_v1 = clean_cats_v1(app_v1)
clean_app_v2 = clean_cats_v2(app_v2)

In [15]:
from pepper.pd_utils import df_neq
is_diff = df_neq(clean_app_v1, clean_app_v2)
print("n_diffs:", is_diff.sum().sum())

n_diffs: 0


### Encoding des catégories binaires

In [27]:
def encode_bin_cats_v1(df):    
    # Categorical features with Binary encode (0 or 1; two categories)
    for bin_feature in ['CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY']:
        # Unused `uniques` has been replaced by `_`
        label_code, _ = pd.factorize(df[bin_feature])
        # Avoid warning
        df.loc[:, bin_feature] = label_code

In [20]:
def encode_bin_cats_v2(data):    
    # Categorical features with Binary encode (0 or 1; two categories)
    bin_vars = ['CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY']
    for bin_var in bin_vars:
        data[bin_var] = data[bin_var].astype("category").cat.codes

In [28]:
encode_bin_cats_v1(clean_app_v1)
encode_bin_cats_v2(clean_app_v2)

In [29]:
from pepper.pd_utils import df_neq
is_diff = df_neq(clean_app_v1, clean_app_v2)
print("n_diffs:", is_diff.sum().sum())

n_diffs: 575913


Rien d'alarmant, les deux techniques d'encoding donnent le même résultat à une permutation des 2 labels près.

Dans tous les cas, nous n'allons pas conserver une telle approche spécifique pour les variables catégorielles binaires mais plutôt utiliser la même approche systématique de one hot encoding que pour toutes les variables catégorielles (voir `one_hot_encode_all_cats` à utiliser conjointement avec `get_categorical_vars`).

Il ne sert effectivement à rien de conserver deux colonnes c'est-à-dire deux variables anticorrélées (voir la section `drop_first` en annexe).

In [34]:
bin_vars = ['CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY']
display(pd.concat([clean_app_v1[bin_vars], clean_app_v2[bin_vars]], axis=1))

Unnamed: 0,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CODE_GENDER.1,FLAG_OWN_CAR.1,FLAG_OWN_REALTY.1
0,0,0,0,0,0,0
1,1,0,1,1,0,1
2,0,1,0,0,1,0
3,1,0,0,1,0,0
4,0,0,0,0,0,0
...,...,...,...,...,...,...
356250,1,1,1,1,0,0
356251,1,0,1,1,0,1
356252,1,1,0,1,1,0
356253,0,0,1,0,0,1


Si après cette opération, il reste des diffs, il faudra investiguer plus en profondeur :

**TODO** C'est le cas, donc faire.

In [33]:
from pepper.pd_utils import df_neq
clean_app_v2.CODE_GENDER = 1 - clean_app_v2.CODE_GENDER
clean_app_v2.FLAG_OWN_REALTY = 1 - clean_app_v2.FLAG_OWN_REALTY
is_diff = df_neq(clean_app_v1, clean_app_v2)
print("n_diffs:", is_diff.sum().sum())

n_diffs: 424829


### One Hot Encoding des catégories n-binaires

In [38]:
from home_credit.lightgbm_kernel import one_hot_encoder
def hot_encode_cats_v1(df, nan_as_category=True):
    # Categorical features with One-Hot encode
    # Unused `cat_cols` has been replaced by `_`
    df, _ = one_hot_encoder(df, nan_as_category)
    return df

In [39]:
from home_credit.lightgbm_kernel_v2 import get_categorical_vars, one_hot_encode_all_cats
def hot_encode_cats_v2(data, nan_as_category=True):
    # Categorical features with One-Hot encode
    return one_hot_encode_all_cats(
        data, get_categorical_vars(data),
        dummy_na=nan_as_category
    )

In [40]:
ohe_app_v1 = hot_encode_cats_v1(clean_app_v1)
ohe_app_v2 = hot_encode_cats_v2(clean_app_v2)

In [41]:
display(ohe_app_v1.shape)
display(ohe_app_v2.shape)

(356251, 255)

(356251, 235)

On ne peut comparer que ce qui est comparable : moins de colonnes dans notre version qui tire parti de `drop_first`. Le diff des colonnes fait ressortir de nombreux `_nan` : c'est logique, quitte à supprimer une colonne, c'est celle qui est supprimée en priorité par rapport à celles qui ont un vrai label de modalité.

In [45]:
display(ohe_app_v1.columns)
display(ohe_app_v2.columns)
display(ohe_app_v1.columns.difference(ohe_app_v2.columns))

Index(['SK_ID_CURR', 'TARGET', 'CODE_GENDER', 'FLAG_OWN_CAR',
       'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT',
       'AMT_ANNUITY', 'AMT_GOODS_PRICE',
       ...
       'WALLSMATERIAL_MODE_Mixed', 'WALLSMATERIAL_MODE_Monolithic',
       'WALLSMATERIAL_MODE_Others', 'WALLSMATERIAL_MODE_Panel',
       'WALLSMATERIAL_MODE_Stone, brick', 'WALLSMATERIAL_MODE_Wooden',
       'WALLSMATERIAL_MODE_nan', 'EMERGENCYSTATE_MODE_No',
       'EMERGENCYSTATE_MODE_Yes', 'EMERGENCYSTATE_MODE_nan'],
      dtype='object', length=255)

Index(['SK_ID_CURR', 'TARGET', 'CODE_GENDER', 'FLAG_OWN_CAR',
       'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT',
       'AMT_ANNUITY', 'AMT_GOODS_PRICE',
       ...
       'HOUSETYPE_MODE_nan', 'WALLSMATERIAL_MODE_Mixed',
       'WALLSMATERIAL_MODE_Monolithic', 'WALLSMATERIAL_MODE_Others',
       'WALLSMATERIAL_MODE_Panel', 'WALLSMATERIAL_MODE_Stone, brick',
       'WALLSMATERIAL_MODE_Wooden', 'WALLSMATERIAL_MODE_nan',
       'EMERGENCYSTATE_MODE_Yes', 'EMERGENCYSTATE_MODE_nan'],
      dtype='object', length=235)

Index(['EMERGENCYSTATE_MODE_No', 'FONDKAPREMONT_MODE_not specified',
       'HOUSETYPE_MODE_block of flats', 'NAME_CONTRACT_TYPE_Cash loans',
       'NAME_CONTRACT_TYPE_nan', 'NAME_EDUCATION_TYPE_Academic degree',
       'NAME_EDUCATION_TYPE_nan', 'NAME_FAMILY_STATUS_Civil marriage',
       'NAME_FAMILY_STATUS_nan', 'NAME_HOUSING_TYPE_Co-op apartment',
       'NAME_HOUSING_TYPE_nan', 'NAME_INCOME_TYPE_Businessman',
       'NAME_INCOME_TYPE_nan', 'NAME_TYPE_SUITE_Children',
       'OCCUPATION_TYPE_Accountants', 'ORGANIZATION_TYPE_Advertising',
       'ORGANIZATION_TYPE_nan', 'WALLSMATERIAL_MODE_Block',
       'WEEKDAY_APPR_PROCESS_START_FRIDAY', 'WEEKDAY_APPR_PROCESS_START_nan'],
      dtype='object')

### Nettoyage des variables numériques

Suppression ou correction des aberrations et valeurs manquantes pour les variables numériques.

Vérifions que 365243 n'apparaît que dans `DAYS_EMPLOYED`

In [51]:
cols = ohe_app_v2.columns
days_cols = cols[cols.str.match("DAYS_")]
days_data = ohe_app_v2[days_cols]
display(days_data.max())
# display(days_data[(days_data == 365243).any(axis=1)])

DAYS_BIRTH                 -7338.0
DAYS_EMPLOYED             365243.0
DAYS_REGISTRATION              0.0
DAYS_ID_PUBLISH                0.0
DAYS_LAST_PHONE_CHANGE         0.0
dtype: float64

In [52]:
def clean_nums_v1(df):
    # NaN values for DAYS_EMPLOYED: 365.243 -> nan
    df['DAYS_EMPLOYED'].replace(365243, np.nan, inplace= True)

In [53]:
from home_credit.feat_eng import nullify_365243
def clean_nums_v2(data):
    # NaN values for DAYS_*: 365.243 -> nan
    nullify_365243(data.DAYS_EMPLOYED)

In [55]:
clean_app_v1 = clean_nums_v1(ohe_app_v1)
clean_app_v2 = clean_nums_v2(ohe_app_v2)

### Création de variables additionnelles dérivées

In [57]:
def add_derived_features_v1(df):
    # Some simple new features (percentages)
    df['DAYS_EMPLOYED_PERC'] = df['DAYS_EMPLOYED'] / df['DAYS_BIRTH']
    df['INCOME_CREDIT_PERC'] = df['AMT_INCOME_TOTAL'] / df['AMT_CREDIT']
    df['INCOME_PER_PERSON'] = df['AMT_INCOME_TOTAL'] / df['CNT_FAM_MEMBERS']
    df['ANNUITY_INCOME_PERC'] = df['AMT_ANNUITY'] / df['AMT_INCOME_TOTAL']
    df['PAYMENT_RATE'] = df['AMT_ANNUITY'] / df['AMT_CREDIT']

In [58]:
def add_derived_features_v2(data):
    # Some simple new features (percentages)
    data.eval(
        """
        DAYS_EMPLOYED_PERC = DAYS_EMPLOYED / DAYS_BIRTH
        INCOME_CREDIT_PERC = AMT_INCOME_TOTAL - AMT_CREDIT
        INCOME_PER_PERSON = AMT_INCOME_TOTAL - CNT_FAM_MEMBERS
        ANNUITY_INCOME_PERC = AMT_ANNUITY - AMT_INCOME_TOTAL
        PAYMENT_RATE = AMT_ANNUITY / AMT_CREDIT
        """,
        inplace=True, engine="numexpr"
    )

In [59]:
ext_app_v1 = add_derived_features_v1(ohe_app_v1)
ext_app_v2 = add_derived_features_v2(ohe_app_v2)

### Version intégrée

In [None]:
# Preprocess application_train.csv and application_test.csv
def application_train_test_v1(nrows=None, nan_as_category=False):
    df = load_application_v1(nrows)
    df = clean_cats_v1(df)
    encode_bin_cats_v1(df)
    df = hot_encode_cats_v2(df, nan_as_category=nan_as_category)
    clean_nums_v1(df)
    add_derived_features_v1(df)
    return df

In [None]:
def application_train_test_v2(nrows=None, nan_as_category=False):
    data = load_application_v2(nrows)
    clean_cats_v2(data)
    encode_bin_cats_v2(data)
    data = hot_encode_cats_v2(data, nan_as_category=nan_as_category)
    clean_nums_v2(data)
    add_derived_features_v2(data)
    return data

## `bureau_and_balance`

```Python
# Preprocess bureau.csv and bureau_balance.csv
def bureau_and_balance(num_rows = None, nan_as_category = True):
    bureau = pd.read_csv('../input/bureau.csv', nrows = num_rows)
    bb = pd.read_csv('../input/bureau_balance.csv', nrows = num_rows)
    bb, bb_cat = one_hot_encoder(bb, nan_as_category)
    bureau, bureau_cat = one_hot_encoder(bureau, nan_as_category)
    
    # Bureau balance: Perform aggregations and merge with bureau.csv
    bb_aggregations = {'MONTHS_BALANCE': ['min', 'max', 'size']}
    for col in bb_cat:
        bb_aggregations[col] = ['mean']
    bb_agg = bb.groupby('SK_ID_BUREAU').agg(bb_aggregations)
    bb_agg.columns = pd.Index([e[0] + "_" + e[1].upper() for e in bb_agg.columns.tolist()])
    bureau = bureau.join(bb_agg, how='left', on='SK_ID_BUREAU')
    bureau.drop(['SK_ID_BUREAU'], axis=1, inplace= True)
    del bb, bb_agg
    gc.collect()
    
    # Bureau and bureau_balance numeric features
    num_aggregations = {
        'DAYS_CREDIT': ['min', 'max', 'mean', 'var'],
        'DAYS_CREDIT_ENDDATE': ['min', 'max', 'mean'],
        'DAYS_CREDIT_UPDATE': ['mean'],
        'CREDIT_DAY_OVERDUE': ['max', 'mean'],
        'AMT_CREDIT_MAX_OVERDUE': ['mean'],
        'AMT_CREDIT_SUM': ['max', 'mean', 'sum'],
        'AMT_CREDIT_SUM_DEBT': ['max', 'mean', 'sum'],
        'AMT_CREDIT_SUM_OVERDUE': ['mean'],
        'AMT_CREDIT_SUM_LIMIT': ['mean', 'sum'],
        'AMT_ANNUITY': ['max', 'mean'],
        'CNT_CREDIT_PROLONG': ['sum'],
        'MONTHS_BALANCE_MIN': ['min'],
        'MONTHS_BALANCE_MAX': ['max'],
        'MONTHS_BALANCE_SIZE': ['mean', 'sum']
    }
    # Bureau and bureau_balance categorical features
    cat_aggregations = {}
    for cat in bureau_cat: cat_aggregations[cat] = ['mean']
    for cat in bb_cat: cat_aggregations[cat + "_MEAN"] = ['mean']
    
    bureau_agg = bureau.groupby('SK_ID_CURR').agg({**num_aggregations, **cat_aggregations})
    bureau_agg.columns = pd.Index(['BURO_' + e[0] + "_" + e[1].upper() for e in bureau_agg.columns.tolist()])
    # Bureau: Active credits - using only numerical aggregations
    active = bureau[bureau['CREDIT_ACTIVE_Active'] == 1]
    active_agg = active.groupby('SK_ID_CURR').agg(num_aggregations)
    active_agg.columns = pd.Index(['ACTIVE_' + e[0] + "_" + e[1].upper() for e in active_agg.columns.tolist()])
    bureau_agg = bureau_agg.join(active_agg, how='left', on='SK_ID_CURR')
    del active, active_agg
    gc.collect()
    # Bureau: Closed credits - using only numerical aggregations
    closed = bureau[bureau['CREDIT_ACTIVE_Closed'] == 1]
    closed_agg = closed.groupby('SK_ID_CURR').agg(num_aggregations)
    closed_agg.columns = pd.Index(['CLOSED_' + e[0] + "_" + e[1].upper() for e in closed_agg.columns.tolist()])
    bureau_agg = bureau_agg.join(closed_agg, how='left', on='SK_ID_CURR')
    del closed, closed_agg, bureau
    gc.collect()
    return bureau_agg
```

In [2]:
from home_credit.load import get_bureau, get_bureau_balance
b = get_bureau()
bb = get_bureau_balance()

display(b.head(3))
display(bb.head(3))

bureau,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
0,215354,5714462,Closed,currency 1,-497,0,-153.0,-153.0,,0,91323.00,0.0,,0.0,Consumer credit,-131,
1,215354,5714463,Active,currency 1,-208,0,1075.0,,,0,225000.00,171342.0,,0.0,Credit card,-20,
2,215354,5714464,Active,currency 1,-203,0,528.0,,,0,464323.50,,,0.0,Consumer credit,-16,
3,215354,5714465,Active,currency 1,-203,0,,,,0,90000.00,,,0.0,Credit card,-16,
4,215354,5714466,Active,currency 1,-629,0,1197.0,,77674.5,0,2700000.00,,,0.0,Consumer credit,-21,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1716423,259355,5057750,Active,currency 1,-44,0,-30.0,,0.0,0,11250.00,11250.0,0.0,0.0,Microloan,-19,
1716424,100044,5057754,Closed,currency 1,-2648,0,-2433.0,-2493.0,5476.5,0,38130.84,0.0,0.0,0.0,Consumer credit,-2493,
1716425,100044,5057762,Closed,currency 1,-1809,0,-1628.0,-970.0,,0,15570.00,,,0.0,Consumer credit,-967,
1716426,246829,5057770,Closed,currency 1,-1878,0,-1513.0,-1513.0,,0,36000.00,0.0,0.0,0.0,Consumer credit,-1508,


bureau_balance,SK_ID_BUREAU,MONTHS_BALANCE,STATUS
0,5715448,0,C
1,5715448,-1,C
2,5715448,-2,C
3,5715448,-3,C
4,5715448,-4,C
...,...,...,...
27299920,5041336,-47,X
27299921,5041336,-48,X
27299922,5041336,-49,X
27299923,5041336,-50,X


In [3]:
from home_credit.lightgbm_kernel import one_hot_encoder
bb, bb_cat = one_hot_encoder(bb, True)
bureau, bureau_cat = one_hot_encoder(bureau, True)

In [5]:
display(bb_cat)
display(bureau_cat)

['STATUS_0',
 'STATUS_1',
 'STATUS_2',
 'STATUS_3',
 'STATUS_4',
 'STATUS_5',
 'STATUS_C',
 'STATUS_X',
 'STATUS_nan']

['CREDIT_ACTIVE_Active',
 'CREDIT_ACTIVE_Bad debt',
 'CREDIT_ACTIVE_Closed',
 'CREDIT_ACTIVE_Sold',
 'CREDIT_ACTIVE_nan',
 'CREDIT_CURRENCY_currency 1',
 'CREDIT_CURRENCY_currency 2',
 'CREDIT_CURRENCY_currency 3',
 'CREDIT_CURRENCY_currency 4',
 'CREDIT_CURRENCY_nan',
 'CREDIT_TYPE_Another type of loan',
 'CREDIT_TYPE_Car loan',
 'CREDIT_TYPE_Cash loan (non-earmarked)',
 'CREDIT_TYPE_Consumer credit',
 'CREDIT_TYPE_Credit card',
 'CREDIT_TYPE_Interbank credit',
 'CREDIT_TYPE_Loan for business development',
 'CREDIT_TYPE_Loan for purchase of shares (margin lending)',
 'CREDIT_TYPE_Loan for the purchase of equipment',
 'CREDIT_TYPE_Loan for working capital replenishment',
 'CREDIT_TYPE_Microloan',
 'CREDIT_TYPE_Mobile operator loan',
 'CREDIT_TYPE_Mortgage',
 'CREDIT_TYPE_Real estate loan',
 'CREDIT_TYPE_Unknown type of loan',
 'CREDIT_TYPE_nan']

In [6]:
# Bureau balance: Perform aggregations and merge with bureau.csv
bb_aggregations = {'MONTHS_BALANCE': ['min', 'max', 'size']}
for col in bb_cat:
    bb_aggregations[col] = ['mean']

L'aggrégation des lignes de `bureau_balance` :
* moyenne de chaque catégorie de STATUS
* min, max, size de MONTHS_BALANCE

Je pense que l'on peut faire mieux cf. mon pivot...

conserver les deux pour pouvoir comparer 1/ les corrélations 2/ les performances finales

In [7]:
display(bb_aggregations)

{'MONTHS_BALANCE': ['min', 'max', 'size'],
 'STATUS_0': ['mean'],
 'STATUS_1': ['mean'],
 'STATUS_2': ['mean'],
 'STATUS_3': ['mean'],
 'STATUS_4': ['mean'],
 'STATUS_5': ['mean'],
 'STATUS_C': ['mean'],
 'STATUS_X': ['mean'],
 'STATUS_nan': ['mean']}

In [9]:
import pandas as pd
bb_agg = bb.groupby('SK_ID_BUREAU').agg(bb_aggregations)
# Reduction du multi-index produit par le groupby
bb_agg.columns = pd.Index([e[0] + "_" + e[1].upper() for e in bb_agg.columns.tolist()])

In [10]:
display(bb_agg)

Unnamed: 0_level_0,MONTHS_BALANCE_MIN,MONTHS_BALANCE_MAX,MONTHS_BALANCE_SIZE,STATUS_0_MEAN,STATUS_1_MEAN,STATUS_2_MEAN,STATUS_3_MEAN,STATUS_4_MEAN,STATUS_5_MEAN,STATUS_C_MEAN,STATUS_X_MEAN,STATUS_nan_MEAN
SK_ID_BUREAU,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
5001709,-96,0,97,0.000000,0.000000,0.0,0.0,0.0,0.0,0.886598,0.113402,0.0
5001710,-82,0,83,0.060241,0.000000,0.0,0.0,0.0,0.0,0.578313,0.361446,0.0
5001711,-3,0,4,0.750000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.250000,0.0
5001712,-18,0,19,0.526316,0.000000,0.0,0.0,0.0,0.0,0.473684,0.000000,0.0
5001713,-21,0,22,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,1.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
6842884,-47,0,48,0.187500,0.000000,0.0,0.0,0.0,0.0,0.416667,0.395833,0.0
6842885,-23,0,24,0.500000,0.000000,0.0,0.0,0.0,0.5,0.000000,0.000000,0.0
6842886,-32,0,33,0.242424,0.000000,0.0,0.0,0.0,0.0,0.757576,0.000000,0.0
6842887,-36,0,37,0.162162,0.000000,0.0,0.0,0.0,0.0,0.837838,0.000000,0.0


Ici, un simple concat serait plus adapté non ? tester

In [11]:
bureau = bureau.join(bb_agg, how='left', on='SK_ID_BUREAU')
bureau.drop(['SK_ID_BUREAU'], axis=1, inplace= True)

In [12]:
display(bureau.head(3))

Unnamed: 0,SK_ID_CURR,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,DAYS_CREDIT_UPDATE,AMT_ANNUITY,CREDIT_ACTIVE_Active,CREDIT_ACTIVE_Bad debt,CREDIT_ACTIVE_Closed,CREDIT_ACTIVE_Sold,CREDIT_ACTIVE_nan,CREDIT_CURRENCY_currency 1,CREDIT_CURRENCY_currency 2,CREDIT_CURRENCY_currency 3,CREDIT_CURRENCY_currency 4,CREDIT_CURRENCY_nan,CREDIT_TYPE_Another type of loan,CREDIT_TYPE_Car loan,CREDIT_TYPE_Cash loan (non-earmarked),CREDIT_TYPE_Consumer credit,CREDIT_TYPE_Credit card,CREDIT_TYPE_Interbank credit,CREDIT_TYPE_Loan for business development,CREDIT_TYPE_Loan for purchase of shares (margin lending),CREDIT_TYPE_Loan for the purchase of equipment,CREDIT_TYPE_Loan for working capital replenishment,CREDIT_TYPE_Microloan,CREDIT_TYPE_Mobile operator loan,CREDIT_TYPE_Mortgage,CREDIT_TYPE_Real estate loan,CREDIT_TYPE_Unknown type of loan,CREDIT_TYPE_nan,MONTHS_BALANCE_MIN,MONTHS_BALANCE_MAX,MONTHS_BALANCE_SIZE,STATUS_0_MEAN,STATUS_1_MEAN,STATUS_2_MEAN,STATUS_3_MEAN,STATUS_4_MEAN,STATUS_5_MEAN,STATUS_C_MEAN,STATUS_X_MEAN,STATUS_nan_MEAN
0,215354,-497,0,-153.0,-153.0,,0,91323.0,0.0,,0.0,-131,,False,False,True,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,,,,,,,,,,,,
1,215354,-208,0,1075.0,,,0,225000.0,171342.0,,0.0,-20,,True,False,False,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,,,,,,,,,,,,
2,215354,-203,0,528.0,,,0,464323.5,,,0.0,-16,,True,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,,,,,,,,,,,,


Pourquoi refaire une aggrégation sur `bureau` : parce que dans `bureau` `SK_ID_CURR` n'est pas une PK, il y a donc plusieurs lignes pour une demande (11 lignes en moyenne par demande).

Là, il y a du monde : le préalable est donc d'avoir terminé mon analyse exploratoire sur l'ensemble des colonnes de `bureau` pour bien comprendre ce que représente chacune des données.

Selon les cas, il choisit un ou plusieurs des ufuncs main, max, mean, sum, var : pourquoi ? comprendre sa logique et la dépasser si possible.

In [14]:
# Bureau and bureau_balance numeric features
num_aggregations = {
    'DAYS_CREDIT': ['min', 'max', 'mean', 'var'],
    'DAYS_CREDIT_ENDDATE': ['min', 'max', 'mean'],
    'DAYS_CREDIT_UPDATE': ['mean'],
    'CREDIT_DAY_OVERDUE': ['max', 'mean'],
    'AMT_CREDIT_MAX_OVERDUE': ['mean'],
    'AMT_CREDIT_SUM': ['max', 'mean', 'sum'],
    'AMT_CREDIT_SUM_DEBT': ['max', 'mean', 'sum'],
    'AMT_CREDIT_SUM_OVERDUE': ['mean'],
    'AMT_CREDIT_SUM_LIMIT': ['mean', 'sum'],
    'AMT_ANNUITY': ['max', 'mean'],
    'CNT_CREDIT_PROLONG': ['sum'],
    'MONTHS_BALANCE_MIN': ['min'],
    'MONTHS_BALANCE_MAX': ['max'],
    'MONTHS_BALANCE_SIZE': ['mean', 'sum']
}
# Bureau and bureau_balance categorical features
cat_aggregations = {}
for cat in bureau_cat:
    cat_aggregations[cat] = ['mean']
for cat in bb_cat:
    cat_aggregations[cat + "_MEAN"] = ['mean']

In [15]:
display(cat_aggregations)

{'CREDIT_ACTIVE_Active': ['mean'],
 'CREDIT_ACTIVE_Bad debt': ['mean'],
 'CREDIT_ACTIVE_Closed': ['mean'],
 'CREDIT_ACTIVE_Sold': ['mean'],
 'CREDIT_ACTIVE_nan': ['mean'],
 'CREDIT_CURRENCY_currency 1': ['mean'],
 'CREDIT_CURRENCY_currency 2': ['mean'],
 'CREDIT_CURRENCY_currency 3': ['mean'],
 'CREDIT_CURRENCY_currency 4': ['mean'],
 'CREDIT_CURRENCY_nan': ['mean'],
 'CREDIT_TYPE_Another type of loan': ['mean'],
 'CREDIT_TYPE_Car loan': ['mean'],
 'CREDIT_TYPE_Cash loan (non-earmarked)': ['mean'],
 'CREDIT_TYPE_Consumer credit': ['mean'],
 'CREDIT_TYPE_Credit card': ['mean'],
 'CREDIT_TYPE_Interbank credit': ['mean'],
 'CREDIT_TYPE_Loan for business development': ['mean'],
 'CREDIT_TYPE_Loan for purchase of shares (margin lending)': ['mean'],
 'CREDIT_TYPE_Loan for the purchase of equipment': ['mean'],
 'CREDIT_TYPE_Loan for working capital replenishment': ['mean'],
 'CREDIT_TYPE_Microloan': ['mean'],
 'CREDIT_TYPE_Mobile operator loan': ['mean'],
 'CREDIT_TYPE_Mortgage': ['mean'],
 '

Second niveau d'aggrégation, sur les variables numériques, et sur les catégorielles :

In [16]:
bureau_agg = bureau.groupby('SK_ID_CURR').agg({**num_aggregations, **cat_aggregations})
bureau_agg.columns = pd.Index(['BURO_' + e[0] + "_" + e[1].upper() for e in bureau_agg.columns.tolist()])

In [18]:
display(bureau_agg.head(3))

Unnamed: 0_level_0,BURO_DAYS_CREDIT_MIN,BURO_DAYS_CREDIT_MAX,BURO_DAYS_CREDIT_MEAN,BURO_DAYS_CREDIT_VAR,BURO_DAYS_CREDIT_ENDDATE_MIN,BURO_DAYS_CREDIT_ENDDATE_MAX,BURO_DAYS_CREDIT_ENDDATE_MEAN,BURO_DAYS_CREDIT_UPDATE_MEAN,BURO_CREDIT_DAY_OVERDUE_MAX,BURO_CREDIT_DAY_OVERDUE_MEAN,BURO_AMT_CREDIT_MAX_OVERDUE_MEAN,BURO_AMT_CREDIT_SUM_MAX,BURO_AMT_CREDIT_SUM_MEAN,BURO_AMT_CREDIT_SUM_SUM,BURO_AMT_CREDIT_SUM_DEBT_MAX,BURO_AMT_CREDIT_SUM_DEBT_MEAN,BURO_AMT_CREDIT_SUM_DEBT_SUM,BURO_AMT_CREDIT_SUM_OVERDUE_MEAN,BURO_AMT_CREDIT_SUM_LIMIT_MEAN,BURO_AMT_CREDIT_SUM_LIMIT_SUM,BURO_AMT_ANNUITY_MAX,BURO_AMT_ANNUITY_MEAN,BURO_CNT_CREDIT_PROLONG_SUM,BURO_MONTHS_BALANCE_MIN_MIN,BURO_MONTHS_BALANCE_MAX_MAX,BURO_MONTHS_BALANCE_SIZE_MEAN,BURO_MONTHS_BALANCE_SIZE_SUM,BURO_CREDIT_ACTIVE_Active_MEAN,BURO_CREDIT_ACTIVE_Bad debt_MEAN,BURO_CREDIT_ACTIVE_Closed_MEAN,BURO_CREDIT_ACTIVE_Sold_MEAN,BURO_CREDIT_ACTIVE_nan_MEAN,BURO_CREDIT_CURRENCY_currency 1_MEAN,BURO_CREDIT_CURRENCY_currency 2_MEAN,BURO_CREDIT_CURRENCY_currency 3_MEAN,BURO_CREDIT_CURRENCY_currency 4_MEAN,BURO_CREDIT_CURRENCY_nan_MEAN,BURO_CREDIT_TYPE_Another type of loan_MEAN,BURO_CREDIT_TYPE_Car loan_MEAN,BURO_CREDIT_TYPE_Cash loan (non-earmarked)_MEAN,BURO_CREDIT_TYPE_Consumer credit_MEAN,BURO_CREDIT_TYPE_Credit card_MEAN,BURO_CREDIT_TYPE_Interbank credit_MEAN,BURO_CREDIT_TYPE_Loan for business development_MEAN,BURO_CREDIT_TYPE_Loan for purchase of shares (margin lending)_MEAN,BURO_CREDIT_TYPE_Loan for the purchase of equipment_MEAN,BURO_CREDIT_TYPE_Loan for working capital replenishment_MEAN,BURO_CREDIT_TYPE_Microloan_MEAN,BURO_CREDIT_TYPE_Mobile operator loan_MEAN,BURO_CREDIT_TYPE_Mortgage_MEAN,BURO_CREDIT_TYPE_Real estate loan_MEAN,BURO_CREDIT_TYPE_Unknown type of loan_MEAN,BURO_CREDIT_TYPE_nan_MEAN,BURO_STATUS_0_MEAN_MEAN,BURO_STATUS_1_MEAN_MEAN,BURO_STATUS_2_MEAN_MEAN,BURO_STATUS_3_MEAN_MEAN,BURO_STATUS_4_MEAN_MEAN,BURO_STATUS_5_MEAN_MEAN,BURO_STATUS_C_MEAN_MEAN,BURO_STATUS_X_MEAN_MEAN,BURO_STATUS_nan_MEAN_MEAN
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1
100001,-1572,-49,-735.0,240043.666667,-1329.0,1778.0,82.428571,-93.142857,0,0.0,,378000.0,207623.571429,1453365.0,373239.0,85240.928571,596686.5,0.0,0.0,0.0,10822.5,3545.357143,0,-51.0,0.0,24.571429,172.0,0.428571,0.0,0.571429,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.336651,0.007519,0.0,0.0,0.0,0.0,0.44124,0.21459,0.0
100002,-1437,-103,-874.0,186150.0,-1072.0,780.0,-349.0,-499.875,0,0.0,1681.029,450000.0,108131.945625,865055.565,245781.0,49156.2,245781.0,0.0,7997.14125,31988.565,0.0,0.0,0,-47.0,0.0,13.75,110.0,0.25,0.0,0.75,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.40696,0.255682,0.0,0.0,0.0,0.0,0.175426,0.161932,0.0
100003,-2586,-606,-1400.75,827783.583333,-2434.0,1216.0,-544.5,-816.0,0,0.0,0.0,810000.0,254350.125,1017400.5,0.0,0.0,0.0,0.0,202500.0,810000.0,,,0,,,,0.0,0.25,0.0,0.75,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,


Mais ce n'est pas terminé, il complète avec une aggrégation spécifique, uniquement sur les variables numériques, et sur le sous-ensemble des deux catégories Active et Closed de la variable CREDIT_ACTIVE.

Là, je ne saisis pas ce qu'il fait..

Si j'entrevois : cela revient à agréger selon les deux axes SK_ID_CURR et CREDIT_ACTIVE puis pivoter CREDIT_ACTIVE en colonnes (et en ne conservant finalement que les deux catégories principales (Sold et bad debt évincés)). Pourtant ces deux catégories plus rares me semblent informatives : bad debt semble indiquer clairement un client qui a fait défaut (il n'y en a que 21) et Sold... difficile à anticiper la motivation, mais dans un monde de tritrisation, les mauvaises créances ont tendances à être celles que l'on cède.

In [22]:
b = get_bureau()
display(b.head(3))
display(b.CREDIT_ACTIVE.value_counts())

bureau,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
0,215354,5714462,Closed,currency 1,-497,0,-153.0,-153.0,,0,91323.0,0.0,,0.0,Consumer credit,-131,
1,215354,5714463,Active,currency 1,-208,0,1075.0,,,0,225000.0,171342.0,,0.0,Credit card,-20,
2,215354,5714464,Active,currency 1,-203,0,528.0,,,0,464323.5,,,0.0,Consumer credit,-16,


CREDIT_ACTIVE
Closed      1079273
Active       630607
Sold           6527
Bad debt         21
Name: count, dtype: int64

## `previous_applications`

```Python
import pandas as pd
import gc

# Preprocess previous_applications.csv
def previous_applications(num_rows = None, nan_as_category = True):
    prev = pd.read_csv('../input/previous_application.csv', nrows = num_rows)
    prev, cat_cols = one_hot_encoder(prev, nan_as_category= True)
    # Days 365.243 values -> nan
    prev['DAYS_FIRST_DRAWING'].replace(365243, np.nan, inplace= True)
    prev['DAYS_FIRST_DUE'].replace(365243, np.nan, inplace= True)
    prev['DAYS_LAST_DUE_1ST_VERSION'].replace(365243, np.nan, inplace= True)
    prev['DAYS_LAST_DUE'].replace(365243, np.nan, inplace= True)
    prev['DAYS_TERMINATION'].replace(365243, np.nan, inplace= True)
    # Add feature: value ask / value received percentage
    prev['APP_CREDIT_PERC'] = prev['AMT_APPLICATION'] / prev['AMT_CREDIT']
    # Previous applications numeric features
    num_aggregations = {
        'AMT_ANNUITY': ['min', 'max', 'mean'],
        'AMT_APPLICATION': ['min', 'max', 'mean'],
        'AMT_CREDIT': ['min', 'max', 'mean'],
        'APP_CREDIT_PERC': ['min', 'max', 'mean', 'var'],
        'AMT_DOWN_PAYMENT': ['min', 'max', 'mean'],
        'AMT_GOODS_PRICE': ['min', 'max', 'mean'],
        'HOUR_APPR_PROCESS_START': ['min', 'max', 'mean'],
        'RATE_DOWN_PAYMENT': ['min', 'max', 'mean'],
        'DAYS_DECISION': ['min', 'max', 'mean'],
        'CNT_PAYMENT': ['mean', 'sum'],
    }
    # Previous applications categorical features
    cat_aggregations = {}
    for cat in cat_cols:
        cat_aggregations[cat] = ['mean']
    
    prev_agg = prev.groupby('SK_ID_CURR').agg({**num_aggregations, **cat_aggregations})
    prev_agg.columns = pd.Index(['PREV_' + e[0] + "_" + e[1].upper() for e in prev_agg.columns.tolist()])
    # Previous Applications: Approved Applications - only numerical features
    approved = prev[prev['NAME_CONTRACT_STATUS_Approved'] == 1]
    approved_agg = approved.groupby('SK_ID_CURR').agg(num_aggregations)
    approved_agg.columns = pd.Index(['APPROVED_' + e[0] + "_" + e[1].upper() for e in approved_agg.columns.tolist()])
    prev_agg = prev_agg.join(approved_agg, how='left', on='SK_ID_CURR')
    # Previous Applications: Refused Applications - only numerical features
    refused = prev[prev['NAME_CONTRACT_STATUS_Refused'] == 1]
    refused_agg = refused.groupby('SK_ID_CURR').agg(num_aggregations)
    refused_agg.columns = pd.Index(['REFUSED_' + e[0] + "_" + e[1].upper() for e in refused_agg.columns.tolist()])
    prev_agg = prev_agg.join(refused_agg, how='left', on='SK_ID_CURR')
    del refused, refused_agg, approved, approved_agg, prev
    gc.collect()
    return prev_agg
```

## `pos_cash`

```Python
import pandas as pd
import gc

# Preprocess POS_CASH_balance.csv
def pos_cash(num_rows = None, nan_as_category = True):
    pos = pd.read_csv('../input/POS_CASH_balance.csv', nrows = num_rows)
    pos, cat_cols = one_hot_encoder(pos, nan_as_category= True)
    # Features
    aggregations = {
        'MONTHS_BALANCE': ['max', 'mean', 'size'],
        'SK_DPD': ['max', 'mean'],
        'SK_DPD_DEF': ['max', 'mean']
    }
    for cat in cat_cols:
        aggregations[cat] = ['mean']
    
    pos_agg = pos.groupby('SK_ID_CURR').agg(aggregations)
    pos_agg.columns = pd.Index(['POS_' + e[0] + "_" + e[1].upper() for e in pos_agg.columns.tolist()])
    # Count pos cash accounts
    pos_agg['POS_COUNT'] = pos.groupby('SK_ID_CURR').size()
    del pos
    gc.collect()
    return pos_agg
```

## `installments_payments`

```Python
import pandas as pd
import gc

# Preprocess installments_payments.csv
def installments_payments(num_rows = None, nan_as_category = True):
    ins = pd.read_csv('../input/installments_payments.csv', nrows = num_rows)
    ins, cat_cols = one_hot_encoder(ins, nan_as_category= True)
    # Percentage and difference paid in each installment (amount paid and installment value)
    ins['PAYMENT_PERC'] = ins['AMT_PAYMENT'] / ins['AMT_INSTALMENT']
    ins['PAYMENT_DIFF'] = ins['AMT_INSTALMENT'] - ins['AMT_PAYMENT']
    # Days past due and days before due (no negative values)
    ins['DPD'] = ins['DAYS_ENTRY_PAYMENT'] - ins['DAYS_INSTALMENT']
    ins['DBD'] = ins['DAYS_INSTALMENT'] - ins['DAYS_ENTRY_PAYMENT']
    ins['DPD'] = ins['DPD'].apply(lambda x: x if x > 0 else 0)
    ins['DBD'] = ins['DBD'].apply(lambda x: x if x > 0 else 0)
    # Features: Perform aggregations
    aggregations = {
        'NUM_INSTALMENT_VERSION': ['nunique'],
        'DPD': ['max', 'mean', 'sum'],
        'DBD': ['max', 'mean', 'sum'],
        'PAYMENT_PERC': ['max', 'mean', 'sum', 'var'],
        'PAYMENT_DIFF': ['max', 'mean', 'sum', 'var'],
        'AMT_INSTALMENT': ['max', 'mean', 'sum'],
        'AMT_PAYMENT': ['min', 'max', 'mean', 'sum'],
        'DAYS_ENTRY_PAYMENT': ['max', 'mean', 'sum']
    }
    for cat in cat_cols:
        aggregations[cat] = ['mean']
    ins_agg = ins.groupby('SK_ID_CURR').agg(aggregations)
    ins_agg.columns = pd.Index(['INSTAL_' + e[0] + "_" + e[1].upper() for e in ins_agg.columns.tolist()])
    # Count installments accounts
    ins_agg['INSTAL_COUNT'] = ins.groupby('SK_ID_CURR').size()
    del ins
    gc.collect()
    return ins_agg
```

In [None]:
import pandas as pd
import gc
from sklearn.preprocessing import OneHotEncoder

# One-hot encoding for categorical columns with get_dummies
def one_hot_encoder(df, nan_as_category=True):
    original_columns = list(df.columns)
    categorical_columns = [col for col in df.columns if df[col].dtype == 'object']
    df = pd.get_dummies(df, columns=categorical_columns, dummy_na=nan_as_category)
    new_columns = [c for c in df.columns if c not in original_columns]
    return df, new_columns


# Preprocess installments_payments.csv
def installments_payments(num_rows = None, nan_as_category = True):
    ins = pd.read_csv('../input/installments_payments.csv', nrows = num_rows)
    ins, cat_cols = one_hot_encoder(ins, nan_as_category= True)
    # Percentage and difference paid in each installment (amount paid and installment value)
    ins['PAYMENT_PERC'] = ins['AMT_PAYMENT'] / ins['AMT_INSTALMENT']
    ins['PAYMENT_DIFF'] = ins['AMT_INSTALMENT'] - ins['AMT_PAYMENT']
    # Days past due and days before due (no negative values)
    ins['DPD'] = ins['DAYS_ENTRY_PAYMENT'] - ins['DAYS_INSTALMENT']
    ins['DBD'] = ins['DAYS_INSTALMENT'] - ins['DAYS_ENTRY_PAYMENT']
    ins['DPD'] = ins['DPD'].apply(lambda x: x if x > 0 else 0)
    ins['DBD'] = ins['DBD'].apply(lambda x: x if x > 0 else 0)
    # Features: Perform aggregations
    aggregations = {
        'NUM_INSTALMENT_VERSION': ['nunique'],
        'DPD': ['max', 'mean', 'sum'],
        'DBD': ['max', 'mean', 'sum'],
        'PAYMENT_PERC': ['max', 'mean', 'sum', 'var'],
        'PAYMENT_DIFF': ['max', 'mean', 'sum', 'var'],
        'AMT_INSTALMENT': ['max', 'mean', 'sum'],
        'AMT_PAYMENT': ['min', 'max', 'mean', 'sum'],
        'DAYS_ENTRY_PAYMENT': ['max', 'mean', 'sum']
    }
    for cat in cat_cols:
        aggregations[cat] = ['mean']
    ins_agg = ins.groupby('SK_ID_CURR').agg(aggregations)
    ins_agg.columns = pd.Index(['INSTAL_' + e[0] + "_" + e[1].upper() for e in ins_agg.columns.tolist()])
    # Count installments accounts
    ins_agg['INSTAL_COUNT'] = ins.groupby('SK_ID_CURR').size()
    del ins
    gc.collect()
    return ins_agg

### Chargement des tables

In [None]:
from home_credit.load import get_installments_payments
df = get_installments_payments()
display(df)

load C:\Users\franc\Projects\pepper_credit_scoring_tool\dataset\pqt\installments_payments.pqt


installments_payments,SK_ID_PREV,SK_ID_CURR,NUM_INSTALMENT_VERSION,NUM_INSTALMENT_NUMBER,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT
0,1054186,161674,1.0,6,-1180.0,-1187.0,6948.360,6948.360
1,1330831,151639,0.0,34,-2156.0,-2156.0,1716.525,1716.525
2,2085231,193053,2.0,1,-63.0,-63.0,25425.000,25425.000
3,2452527,199697,1.0,3,-2418.0,-2426.0,24350.130,24350.130
4,2714724,167756,1.0,2,-1383.0,-1366.0,2165.040,2160.585
...,...,...,...,...,...,...,...,...
13605396,2186857,428057,0.0,66,-1624.0,,67.500,
13605397,1310347,414406,0.0,47,-1539.0,,67.500,
13605398,1308766,402199,0.0,43,-7.0,,43737.435,
13605399,1062206,409297,0.0,43,-1986.0,,67.500,


In [None]:
from home_credit.load import _load_installments_payments
from numpy import where
import time


# Preprocess installments_payments.csv
def installments_payments(num_rows=None, nan_as_category=True):
    #ins = pd.read_csv('../input/installments_payments.csv', nrows = num_rows)
    ins = _load_installments_payments(nrows=num_rows)
    ins, cat_cols = one_hot_encoder(ins, nan_as_category=nan_as_category)

    df_2 = _load_installments_payments()

    t = -time.time()
    # ✔ Gain de lisibilité
    # ✔ Gain de performance : un ordre de grandeur
    df_2.eval(
        """
        PAYMENT_PERC = AMT_PAYMENT / AMT_INSTALMENT
        PAYMENT_DIFF = AMT_INSTALMENT - AMT_PAYMENT
        DPD = DAYS_ENTRY_PAYMENT - DAYS_INSTALMENT
        DBD = DAYS_INSTALMENT - DAYS_ENTRY_PAYMENT
        DPD = @where(DPD > 0, DPD, 0)
        DBD = @where(DBD > 0, DBD, 0)
        """,
        inplace=True, engine="numexpr"
    )
    t += time.time()
    print(t)

## `credit_card_balance`

```Python
import pandas as pd
import gc

# Preprocess credit_card_balance.csv
def credit_card_balance(num_rows = None, nan_as_category = True):
    cc = pd.read_csv('../input/credit_card_balance.csv', nrows = num_rows)
    cc, cat_cols = one_hot_encoder(cc, nan_as_category= True)
    # General aggregations
    cc.drop(['SK_ID_PREV'], axis= 1, inplace = True)
    cc_agg = cc.groupby('SK_ID_CURR').agg(['min', 'max', 'mean', 'sum', 'var'])
    cc_agg.columns = pd.Index(['CC_' + e[0] + "_" + e[1].upper() for e in cc_agg.columns.tolist()])
    # Count credit card lines
    cc_agg['CC_COUNT'] = cc.groupby('SK_ID_CURR').size()
    del cc
    gc.collect()
    return cc_agg
```

## `kfold_lightgbm`

```Python
import pandas as pd
import gc

# LightGBM GBDT with KFold or Stratified KFold
# Parameters from Tilii kernel: https://www.kaggle.com/tilii7/olivier-lightgbm-parameters-by-bayesian-opt/code
def kfold_lightgbm(df, num_folds, stratified = False, debug= False):
    # Divide in training/validation and test data
    train_df = df[df['TARGET'].notnull()]
    test_df = df[df['TARGET'].isnull()]
    print("Starting LightGBM. Train shape: {}, test shape: {}".format(train_df.shape, test_df.shape))
    del df
    gc.collect()
    # Cross validation model
    if stratified:
        folds = StratifiedKFold(n_splits= num_folds, shuffle=True, random_state=1001)
    else:
        folds = KFold(n_splits= num_folds, shuffle=True, random_state=1001)
    # Create arrays and dataframes to store results
    oof_preds = np.zeros(train_df.shape[0])
    sub_preds = np.zeros(test_df.shape[0])
    feature_importance_df = pd.DataFrame()
    feats = [f for f in train_df.columns if f not in ['TARGET','SK_ID_CURR','SK_ID_BUREAU','SK_ID_PREV','index']]
    
    for n_fold, (train_idx, valid_idx) in enumerate(folds.split(train_df[feats], train_df['TARGET'])):
        train_x, train_y = train_df[feats].iloc[train_idx], train_df['TARGET'].iloc[train_idx]
        valid_x, valid_y = train_df[feats].iloc[valid_idx], train_df['TARGET'].iloc[valid_idx]

        # LightGBM parameters found by Bayesian optimization
        clf = LGBMClassifier(
            nthread=4,
            n_estimators=10000,
            learning_rate=0.02,
            num_leaves=34,
            colsample_bytree=0.9497036,
            subsample=0.8715623,
            max_depth=8,
            reg_alpha=0.041545473,
            reg_lambda=0.0735294,
            min_split_gain=0.0222415,
            min_child_weight=39.3259775,
            silent=-1,
            verbose=-1, )

        clf.fit(train_x, train_y, eval_set=[(train_x, train_y), (valid_x, valid_y)], 
            eval_metric= 'auc', verbose= 200, early_stopping_rounds= 200)

        oof_preds[valid_idx] = clf.predict_proba(valid_x, num_iteration=clf.best_iteration_)[:, 1]
        sub_preds += clf.predict_proba(test_df[feats], num_iteration=clf.best_iteration_)[:, 1] / folds.n_splits

        fold_importance_df = pd.DataFrame()
        fold_importance_df["feature"] = feats
        fold_importance_df["importance"] = clf.feature_importances_
        fold_importance_df["fold"] = n_fold + 1
        feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
        print('Fold %2d AUC : %.6f' % (n_fold + 1, roc_auc_score(valid_y, oof_preds[valid_idx])))
        del clf, train_x, train_y, valid_x, valid_y
        gc.collect()

    print('Full AUC score %.6f' % roc_auc_score(train_df['TARGET'], oof_preds))
    # Write submission file and plot feature importance
    if not debug:
        test_df['TARGET'] = sub_preds
        test_df[['SK_ID_CURR', 'TARGET']].to_csv(submission_file_name, index= False)
    display_importances(feature_importance_df)
    return feature_importance_df
```

## `display_importances`

```Python
import pandas as pd
import gc

# Display/plot feature importance
def display_importances(feature_importance_df_):
    cols = feature_importance_df_[["feature", "importance"]].groupby("feature").mean().sort_values(by="importance", ascending=False)[:40].index
    best_features = feature_importance_df_.loc[feature_importance_df_.feature.isin(cols)]
    plt.figure(figsize=(8, 10))
    sns.barplot(x="importance", y="feature", data=best_features.sort_values(by="importance", ascending=False))
    plt.title('LightGBM Features (avg over folds)')
    plt.tight_layout()
    plt.savefig('lgbm_importances01.png')
```

## `main`

```Python
import pandas as pd
import gc

def main(debug = False):
    num_rows = 10000 if debug else None
    df = application_train_test(num_rows)
    with timer("Process bureau and bureau_balance"):
        bureau = bureau_and_balance(num_rows)
        print("Bureau df shape:", bureau.shape)
        df = df.join(bureau, how='left', on='SK_ID_CURR')
        del bureau
        gc.collect()
    with timer("Process previous_applications"):
        prev = previous_applications(num_rows)
        print("Previous applications df shape:", prev.shape)
        df = df.join(prev, how='left', on='SK_ID_CURR')
        del prev
        gc.collect()
    with timer("Process POS-CASH balance"):
        pos = pos_cash(num_rows)
        print("Pos-cash balance df shape:", pos.shape)
        df = df.join(pos, how='left', on='SK_ID_CURR')
        del pos
        gc.collect()
    with timer("Process installments payments"):
        ins = installments_payments(num_rows)
        print("Installments payments df shape:", ins.shape)
        df = df.join(ins, how='left', on='SK_ID_CURR')
        del ins
        gc.collect()
    with timer("Process credit card balance"):
        cc = credit_card_balance(num_rows)
        print("Credit card balance df shape:", cc.shape)
        df = df.join(cc, how='left', on='SK_ID_CURR')
        del cc
        gc.collect()
    with timer("Run LightGBM with kfold"):
        feat_importance = kfold_lightgbm(df, num_folds= 10, stratified= False, debug= debug)

if __name__ == "__main__":
    submission_file_name = "submission_kernel02.csv"
    with timer("Full model run"):
        main()
```

# Annexes

Le but est de comprendre et d'améliorer la version de référence.

Nous progressons donc par petites étapes, à partir du chargement de la table.

## Comparaison des versions

Démonstration (qui a également servi aux tests et à la mise au point) des opérations pour comparer les versions d'origine et modifiée.

### Alignement

Les dataframes à comparer peuvent être égaux à une permutation près des lignes ou des colonnes.

L'utilitaire suivante permet de réaligner nos dataframes souvents triés et réindexés avec les versions brutes telles qu'elles sont chargées par le kernel d'origine.

In [None]:
from pepper.pd_utils import align_df2_on_df1
app_v1 = load_application_v1()
app_v2 = align_df2_on_df1("SK_ID_CURR", app_v1, load_application_v2())
display(app_v1)
display(app_v2)

load C:\Users\franc\Projects\pepper_credit_scoring_tool\dataset\pqt\application_train.pqt
load C:\Users\franc\Projects\pepper_credit_scoring_tool\dataset\pqt\application_test.pqt


Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1.0,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0.0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0.0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0.0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0.0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
356250,456221,,Cash loans,F,N,Y,0,121500.0,412560.0,17473.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
356251,456222,,Cash loans,F,N,N,2,157500.0,622413.0,31909.5,...,0,0,0,0,,,,,,
356252,456223,,Cash loans,F,Y,Y,1,202500.0,315000.0,33205.5,...,0,0,0,0,0.0,0.0,0.0,0.0,3.0,1.0
356253,456224,,Cash loans,M,N,N,0,225000.0,450000.0,25128.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,2.0


application,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1.0,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0.0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0.0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0.0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0.0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
356250,456221,,Cash loans,F,N,Y,0,121500.0,412560.0,17473.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
356251,456222,,Cash loans,F,N,N,2,157500.0,622413.0,31909.5,...,0,0,0,0,,,,,,
356252,456223,,Cash loans,F,Y,Y,1,202500.0,315000.0,33205.5,...,0,0,0,0,0.0,0.0,0.0,0.0,3.0,1.0
356253,456224,,Cash loans,M,N,N,0,225000.0,450000.0,25128.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,2.0


### Masque de comparaison

`pepper.pd_utils` contient deux fonctions `df_eq` et `df_neq` qui évitent le piège du `x == x` qui retourne `False` si `x` est NA. Ensuite, il faut utiliser `all` et `any` pour en tirer parti.

In [None]:
from pepper.pd_utils import df_neq
is_diff = df_neq(app_v1, app_v2)
print("n_diffs:", is_diff.sum().sum())
print("n_diffs by cols:\n", is_diff.sum(), sep="")
print("n_diffs by rows:\n", is_diff.sum(axis=1), sep="")
display(is_diff.any())
display(is_diff.any(axis=1))

n_diffs: 0
n_diffs by cols:
SK_ID_CURR                    0
TARGET                        0
NAME_CONTRACT_TYPE            0
CODE_GENDER                   0
FLAG_OWN_CAR                  0
                             ..
AMT_REQ_CREDIT_BUREAU_DAY     0
AMT_REQ_CREDIT_BUREAU_WEEK    0
AMT_REQ_CREDIT_BUREAU_MON     0
AMT_REQ_CREDIT_BUREAU_QRT     0
AMT_REQ_CREDIT_BUREAU_YEAR    0
Length: 122, dtype: int64
n_diffs by rows:
0         0
1         0
2         0
3         0
4         0
         ..
356250    0
356251    0
356252    0
356253    0
356254    0
Length: 356255, dtype: int64


SK_ID_CURR                    False
TARGET                        False
NAME_CONTRACT_TYPE            False
CODE_GENDER                   False
FLAG_OWN_CAR                  False
                              ...  
AMT_REQ_CREDIT_BUREAU_DAY     False
AMT_REQ_CREDIT_BUREAU_WEEK    False
AMT_REQ_CREDIT_BUREAU_MON     False
AMT_REQ_CREDIT_BUREAU_QRT     False
AMT_REQ_CREDIT_BUREAU_YEAR    False
Length: 122, dtype: bool

0         False
1         False
2         False
3         False
4         False
          ...  
356250    False
356251    False
356252    False
356253    False
356254    False
Length: 356255, dtype: bool

### Détection des variations locales de `dtype`

In [None]:
dtypes_diff = app_v1.dtypes != app_v2.dtypes
if (~dtypes_diff).all():
    print("dtypes are aligned")
else:
    print("dtypes diffs:")
    display(app_v1.dtypes[dtypes_diff])
    display(app_v2.dtypes[dtypes_diff])

dtypes are aligned


### Caculer la différence (la distance) entre les coefficients

In [None]:
from pepper.pd_utils import safe_diff_series
display(safe_diff_series(app_v1.SK_ID_CURR, app_v2.SK_ID_CURR).sum())
display(safe_diff_series(app_v1.TARGET, app_v2.TARGET).sum())

0

0.0

In [None]:
display(app_v1[is_diff.TARGET].TARGET)
display(app_v2[is_diff.TARGET].TARGET)

Series([], Name: TARGET, dtype: float64)

Series([], Name: TARGET, dtype: float64)

In [None]:
from pepper.pd_utils import safe_diff_dataframe
diff = safe_diff_dataframe(app_v1, app_v2)

Ici, finition : en cas de diff, filtrer facilement des plages de la matrice des différences.

In [None]:
display(diff.loc[is_diff.any(axis=1), is_diff.any()])

Unnamed: 0,TARGET,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,OWN_CAR_AGE,OCCUPATION_TYPE,CNT_FAM_MEMBERS,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,...,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,0.0,0.0,0.0,,,,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,,,,0.0,0.0,0.0,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,,0.0,,0.0,,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,,,,0.0,,0.0,,...,0.0,0.0,0.0,0.0,,,,,,
4,0.0,0.0,0.0,,,,0.0,,0.0,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
356250,,0.0,0.0,,,,0.0,,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
356251,,0.0,0.0,,,,0.0,,0.0,,...,0.0,0.0,0.0,0.0,,,,,,
356252,,0.0,0.0,,0.0,,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
356253,,0.0,0.0,,,,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Hot encoding

Les tables `application_{train|test}` ont typiquement de nombreuses variables catégorielles.

Le critère choisi par la version de référence de `dtype == object` est discutable.

Nous aurons intérêt à encoder à chaud, par exemple, les variables binaires dont le `dtype` est `int`.

Les variables entières avec un nombre restreint de modalités auront également, pour nombre d'entre elles, intérêt à être considérées comme des catégories, y compris si elles représentent des cardinaux (nombre d'ascenseurs, nombre d'enfants par exemple), et pas seulement dans le cas des ordinaux (heure de la journée par exemple).

Notre amélioration va donc se concentrer sur la sélection des variables considérées comme catégorielles, avec une sélection par défaut isofonctionnelle au noyau de référence.

In [None]:
import pandas as pd

# One-hot encoding for categorical columns with get_dummies
def one_hot_encoder(df, nan_as_category=True):
    original_columns = list(df.columns)
    categorical_columns = [col for col in df.columns if df[col].dtype == 'object']
    df = pd.get_dummies(df, columns=categorical_columns, dummy_na=nan_as_category)
    new_columns = [c for c in df.columns if c not in original_columns]
    return df, new_columns

### Sélection des variables catégorielles

In [None]:
from home_credit.load import get_application_train
df = get_application_train()
display(df.head(3))

application_train,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
original_columns = list(df.columns)
categorical_columns = [col for col in df.columns if df[col].dtype == 'object']
display(categorical_columns)

['NAME_CONTRACT_TYPE',
 'CODE_GENDER',
 'FLAG_OWN_CAR',
 'FLAG_OWN_REALTY',
 'NAME_TYPE_SUITE',
 'NAME_INCOME_TYPE',
 'NAME_EDUCATION_TYPE',
 'NAME_FAMILY_STATUS',
 'NAME_HOUSING_TYPE',
 'OCCUPATION_TYPE',
 'WEEKDAY_APPR_PROCESS_START',
 'ORGANIZATION_TYPE',
 'FONDKAPREMONT_MODE',
 'HOUSETYPE_MODE',
 'WALLSMATERIAL_MODE',
 'EMERGENCYSTATE_MODE']

In [None]:
from home_credit.lightgbm_kernel_v2 import get_categorical_vars
display(get_categorical_vars(df))
display(get_categorical_vars(df, dtype=None, max_modalities=2))

['NAME_CONTRACT_TYPE',
 'CODE_GENDER',
 'FLAG_OWN_CAR',
 'FLAG_OWN_REALTY',
 'NAME_TYPE_SUITE',
 'NAME_INCOME_TYPE',
 'NAME_EDUCATION_TYPE',
 'NAME_FAMILY_STATUS',
 'NAME_HOUSING_TYPE',
 'OCCUPATION_TYPE',
 'WEEKDAY_APPR_PROCESS_START',
 'ORGANIZATION_TYPE',
 'FONDKAPREMONT_MODE',
 'HOUSETYPE_MODE',
 'WALLSMATERIAL_MODE',
 'EMERGENCYSTATE_MODE']

['TARGET',
 'NAME_CONTRACT_TYPE',
 'FLAG_OWN_CAR',
 'FLAG_OWN_REALTY',
 'FLAG_MOBIL',
 'FLAG_EMP_PHONE',
 'FLAG_WORK_PHONE',
 'FLAG_CONT_MOBILE',
 'FLAG_PHONE',
 'FLAG_EMAIL',
 'REG_REGION_NOT_LIVE_REGION',
 'REG_REGION_NOT_WORK_REGION',
 'LIVE_REGION_NOT_WORK_REGION',
 'REG_CITY_NOT_LIVE_CITY',
 'REG_CITY_NOT_WORK_CITY',
 'LIVE_CITY_NOT_WORK_CITY',
 'EMERGENCYSTATE_MODE',
 'FLAG_DOCUMENT_2',
 'FLAG_DOCUMENT_3',
 'FLAG_DOCUMENT_4',
 'FLAG_DOCUMENT_5',
 'FLAG_DOCUMENT_6',
 'FLAG_DOCUMENT_7',
 'FLAG_DOCUMENT_8',
 'FLAG_DOCUMENT_9',
 'FLAG_DOCUMENT_10',
 'FLAG_DOCUMENT_11',
 'FLAG_DOCUMENT_12',
 'FLAG_DOCUMENT_13',
 'FLAG_DOCUMENT_14',
 'FLAG_DOCUMENT_15',
 'FLAG_DOCUMENT_16',
 'FLAG_DOCUMENT_17',
 'FLAG_DOCUMENT_18',
 'FLAG_DOCUMENT_19',
 'FLAG_DOCUMENT_20',
 'FLAG_DOCUMENT_21']

### Hot encoding avec `get_dummies`

Voir la documentation utilisateur de Pandas 2.0 : https://pandas.pydata.org/docs/user_guide/reshaping.html#reshaping-dummies

Elle illustre notamment l'utilisation conjointe avec `cut`.

`get_dummy` peut produire un tableau dense (par défaut) ou creux (voir https://pandas.pydata.org/docs/reference/api/pandas.arrays.SparseArray.html).

#### Comment ça marche ?

In [None]:
import pandas as pd
a = df.CODE_GENDER
b = pd.get_dummies(a)
c = a.str.get_dummies()
display(pd.concat([a, b, c], axis=1).head(3))

Unnamed: 0,CODE_GENDER,F,M,XNA,F.1,M.1,XNA.1
0,M,False,True,False,0,1,0
1,F,True,False,False,1,0,0
2,M,False,True,False,0,1,0


In [None]:
x = df[["NAME_CONTRACT_TYPE", "CODE_GENDER"]]
y = pd.get_dummies(x, dummy_na=True)
display(pd.concat([x, y], axis=1).head(3))

Unnamed: 0,NAME_CONTRACT_TYPE,CODE_GENDER,NAME_CONTRACT_TYPE_Cash loans,NAME_CONTRACT_TYPE_Revolving loans,NAME_CONTRACT_TYPE_nan,CODE_GENDER_F,CODE_GENDER_M,CODE_GENDER_XNA,CODE_GENDER_nan
0,Cash loans,M,True,False,False,False,True,False,False
1,Cash loans,F,True,False,False,True,False,False,False
2,Revolving loans,M,False,True,False,False,True,False,False


#### Le cas des NA

In [None]:
# deux pb avec les NA :
# 1/ leur codage spécial => convertir en vrai NA
# 2/ si dummy_na=True mais qu'il n'y en a pas => une colonne pour rien
# la soluce : générer puis supprimer les constantes
display(a.value_counts(dropna=False))
display(y.CODE_GENDER_XNA.value_counts(dropna=False))
display(y.CODE_GENDER_nan.value_counts(dropna=False))

CODE_GENDER
F      202448
M      105059
XNA         4
Name: count, dtype: int64

CODE_GENDER_XNA
False    307507
True          4
Name: count, dtype: int64

CODE_GENDER_nan
False    307511
Name: count, dtype: int64

#### Suppression des colonnes constantes (non NA)

In [None]:
z = y.apply(pd.Series.nunique)
# Cette version générale est plus secure,
# mais celle qui précède est suffisante dans le contexte d'utilisation
# z = y.apply(lambda s: pd.Series.nunique(s, dropna=False))
print(list(z[z == 1].index))
truc = y.drop(columns=z[z == 1].index)
display(truc)

['NAME_CONTRACT_TYPE_nan', 'CODE_GENDER_nan']


Unnamed: 0,NAME_CONTRACT_TYPE_Cash loans,NAME_CONTRACT_TYPE_Revolving loans,CODE_GENDER_F,CODE_GENDER_M,CODE_GENDER_XNA
0,True,False,False,True,False
1,True,False,True,False,False
2,False,True,False,True,False
3,True,False,True,False,False
4,True,False,False,True,False
...,...,...,...,...,...
307506,True,False,False,True,False
307507,True,False,True,False,False
307508,True,False,True,False,False
307509,True,False,True,False,False


#### Intérêt de `drop_first`

N'oublions pas que près de 20 % des variables sont binaires.

Souhaitons en faire 40 ou bien 80 colonnes ?

In [None]:
# drop first : pertinent par exemple pour qu'une variable binaire ne donne pas deux colonnes
# ces deux colonnes seraient parfaitement anti-corrélées, donc corrélées
# logiquement, c'est l'un des premiers trucs qu'élimine une réduction de dimensionnalité
# le vérifier, en le conservant comme une option par défaut à True (à l'opposé de la valeur par défaut)
x = df[["NAME_CONTRACT_TYPE", "CODE_GENDER"]]
y = pd.get_dummies(x, dummy_na=True, drop_first=True)
display(pd.concat([x, y], axis=1).head(3))
m = y.memory_usage(deep=True)
print(m)
print(m.sum())

Unnamed: 0,NAME_CONTRACT_TYPE,CODE_GENDER,NAME_CONTRACT_TYPE_Revolving loans,NAME_CONTRACT_TYPE_nan,CODE_GENDER_M,CODE_GENDER_XNA,CODE_GENDER_nan
0,Cash loans,M,False,False,True,False,False
1,Cash loans,F,False,False,False,False,False
2,Revolving loans,M,True,False,True,False,False


Index                                    128
NAME_CONTRACT_TYPE_Revolving loans    307511
NAME_CONTRACT_TYPE_nan                307511
CODE_GENDER_M                         307511
CODE_GENDER_XNA                       307511
CODE_GENDER_nan                       307511
dtype: int64
1537683


#### Question de l'empreinte mémoire

Faut-il des bools, des entiers int8, une matrice creuse, ... ?

In [None]:
import numpy as np
x = df[["NAME_CONTRACT_TYPE", "CODE_GENDER"]]
y = pd.get_dummies(x, dummy_na=True, drop_first=True, dtype=np.int8)
display(pd.concat([x, y], axis=1).head(3))
m = y.memory_usage(deep=True)
print(m)
print(m.sum())

Unnamed: 0,NAME_CONTRACT_TYPE,CODE_GENDER,NAME_CONTRACT_TYPE_Revolving loans,NAME_CONTRACT_TYPE_nan,CODE_GENDER_M,CODE_GENDER_XNA,CODE_GENDER_nan
0,Cash loans,M,0,0,1,0,0
1,Cash loans,F,0,0,0,0,0
2,Revolving loans,M,1,0,1,0,0


Index                                    128
NAME_CONTRACT_TYPE_Revolving loans    307511
NAME_CONTRACT_TYPE_nan                307511
CODE_GENDER_M                         307511
CODE_GENDER_XNA                       307511
CODE_GENDER_nan                       307511
dtype: int64
1537683


In [None]:
import numpy as np
x = df[["NAME_CONTRACT_TYPE", "CODE_GENDER"]]
y = pd.get_dummies(x, dummy_na=True, drop_first=True, dtype=np.int8, sparse=True)
display(pd.concat([x, y], axis=1).head(3))
m = y.memory_usage(deep=True)
print(m)
print(m.sum())

Unnamed: 0,NAME_CONTRACT_TYPE,CODE_GENDER,NAME_CONTRACT_TYPE_Revolving loans,NAME_CONTRACT_TYPE_nan,CODE_GENDER_M,CODE_GENDER_XNA,CODE_GENDER_nan
0,Cash loans,M,0,0,1,0,0
1,Cash loans,F,0,0,0,0,0
2,Revolving loans,M,1,0,1,0,0


Index                                    128
NAME_CONTRACT_TYPE_Revolving loans    146395
NAME_CONTRACT_TYPE_nan                     0
CODE_GENDER_M                         525295
CODE_GENDER_XNA                           20
CODE_GENDER_nan                            0
dtype: int64
671838


#### Pourquoi ce nom inapproprié `get_dummies` ?

Echange avec chatGPT :

*Je me demandais pourquoi la fonction get_dummies de Pandas, qui effectue un hot encoding s'appelle comme elle s'appelle. En d'autres termes, quelle histoire à justifié ce nom éloigné de ce qu'elle fait ?*

En effet, le nom "get_dummies" de la fonction de Pandas peut sembler étrange et éloigné de l'encodage one-hot qu'elle effectue. La raison de ce nom remonte à l'origine de la fonctionnalité qui a été implémentée pour la première fois dans le langage de programmation R. Dans R, la fonction équivalente s'appelle "dummy variable", d'où le terme "dummies" dans le nom de la fonction de Pandas.

Le terme "dummy variable" a été introduit dans la théorie de la régression linéaire pour représenter des variables catégorielles qui ne peuvent pas être directement intégrées dans un modèle de régression linéaire. À la place, ces variables catégorielles sont remplacées par un ensemble de variables binaires qui indiquent la présence ou l'absence de chaque catégorie. Ces variables binaires sont souvent appelées "dummy variables", d'où le nom de la fonction dans R et, par extension, de la fonction de Pandas.

En résumé, le nom "get_dummies" de la fonction de Pandas est un héritage du nom de la fonction équivalente dans le langage R, qui lui-même fait référence aux variables binaires qui représentent les catégories dans un modèle de régression linéaire.

#### Version intégrée de `one_hot_encode_all_cats` :

**NB** Il reste, en amont dans les premiers prétraitements, à identifier les NA encodés maison par HomeCredit et à les convertir en vrais NA.

In [None]:
def one_hot_encode_all_cats(
    df,
    columns=None,
    dummy_na=True,
    drop_first=True,
    dtype=np.int8,
    sparse=True
):
    ohe_df = pd.get_dummies(
        df, columns=columns, dummy_na=dummy_na,
        drop_first=drop_first, dtype=dtype, sparse=sparse
    )
    # Supression des colonnes constantes éventuellement produites par `dummy_na`
    const_cols = ohe_df.apply(pd.Series.nunique)
    const_cols = const_cols[const_cols == 1]
    ohe_df.drop(columns=const_cols.index, inplace=True)
    return ohe_df


ohe_df = one_hot_encode_all_cats(df[["NAME_CONTRACT_TYPE", "CODE_GENDER"]])
display(ohe_df)


Unnamed: 0,NAME_CONTRACT_TYPE_Revolving loans,CODE_GENDER_M,CODE_GENDER_XNA
0,0,1,0
1,0,0,0
2,1,1,0
3,0,0,0
4,0,1,0
...,...,...,...
307506,0,1,0
307507,0,0,0
307508,0,0,0
307509,0,0,0


## Caractéristiques dérivées : de l'intérêt de `eval`

L'ingénierie des caractéristiques a notamment pour objectif de produire des caractéristiques dérivées.

Dans ce contexte, `eval` permet un gain de lisibilité du code et de performance d'exécution.

[**Documentation d'utilisation de `pandas eval` pour améliorer les performances**](https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html#enhancingperf-eval).

### Comment ça marche ?

```python
    ins['PAYMENT_PERC'] = ins['AMT_PAYMENT'] / ins['AMT_INSTALMENT']
    ins['PAYMENT_DIFF'] = ins['AMT_INSTALMENT'] - ins['AMT_PAYMENT']
```

In [None]:
from home_credit.load import get_installments_payments
df = get_installments_payments()
display(df)

In [None]:
import time

#amt_payment = df.AMT_PAYMENT
#amt_installment = df.AMT_INSTALMENT
t = -time.time()
df['PAYMENT_PERC'] = df['AMT_PAYMENT'] / df['AMT_INSTALMENT']
t += time.time()
print(t)

0.056481361389160156


In [None]:
import time
import pandas as pd

t = -time.time()
#df['PAYMENT_PERC_2'] = df['AMT_PAYMENT'] / df['AMT_INSTALMENT']
df.eval("PAYMENT_PERC_2 = AMT_PAYMENT / AMT_INSTALMENT", inplace=True, engine="python")
t += time.time()
print(t)

0.16107678413391113


In [None]:
import time
import pandas as pd

t = -time.time()
#df['PAYMENT_PERC_2'] = df['AMT_PAYMENT'] / df['AMT_INSTALMENT']
df.eval("PAYMENT_PERC_2 = AMT_PAYMENT / AMT_INSTALMENT", inplace=True, engine="numexpr")
t += time.time()
print(t)

0.13232922554016113


### Pandas assign : ce n'est pas une bonne alternative

In [None]:
import time
import pandas as pd

t = -time.time()
#df['PAYMENT_PERC_2'] = df['AMT_PAYMENT'] / df['AMT_INSTALMENT']
df.assign(PAYMENT_PERC_3=df.AMT_PAYMENT / df.AMT_INSTALMENT)
t += time.time()
print(t)

0.8159232139587402


### Tout en une seule fois

#### Cf. kernel

In [None]:
from home_credit.load import _load_installments_payments
import time

df_1 = _load_installments_payments()

t = -time.time()
# Percentage and difference paid in each installment (amount paid and installment value)
df_1['PAYMENT_PERC'] = df_1['AMT_PAYMENT'] / df_1['AMT_INSTALMENT']
df_1['PAYMENT_DIFF'] = df_1['AMT_INSTALMENT'] - df_1['AMT_PAYMENT']
# Days past due and days before due (no negative values)
df_1['DPD'] = df_1['DAYS_ENTRY_PAYMENT'] - df_1['DAYS_INSTALMENT']
df_1['DBD'] = df_1['DAYS_INSTALMENT'] - df_1['DAYS_ENTRY_PAYMENT']
df_1['DPD'] = df_1['DPD'].apply(lambda x: x if x > 0 else 0)
df_1['DBD'] = df_1['DBD'].apply(lambda x: x if x > 0 else 0)
t += time.time()
print(t)

load C:\Users\franc\Projects\pepper_credit_scoring_tool\dataset\pqt\installments_payments.pqt
7.973264455795288


#### Avec eval

In [None]:
from home_credit.load import _load_installments_payments
from numpy import where
import time

df_2 = _load_installments_payments()

t = -time.time()
# ✔ Gain de lisibilité
# ✔ Gain de performance : un ordre de grandeur
df_2.eval(
    """
    PAYMENT_PERC = AMT_PAYMENT / AMT_INSTALMENT
    PAYMENT_DIFF = AMT_INSTALMENT - AMT_PAYMENT
    DPD = DAYS_ENTRY_PAYMENT - DAYS_INSTALMENT
    DBD = DAYS_INSTALMENT - DAYS_ENTRY_PAYMENT
    DPD = @where(DPD > 0, DPD, 0)
    DBD = @where(DBD > 0, DBD, 0)
    """,
    inplace=True, engine="numexpr"
)
t += time.time()
print(t)

load C:\Users\franc\Projects\pepper_credit_scoring_tool\dataset\pqt\installments_payments.pqt
0.6242649555206299


#### Compare dfs

In [None]:
# display(df_1)
# display(df_2)
print("same result:", all(df_1 == df_2))

same result: True


## `del` et `gc.collect`

**TODO** faire une démonstration imparable de ce que j'avance ci-après :

Dans la plupart des fonctions, le dataframe de travail chargé puis modifié est explicitement supprimé de la mémoire à l'aide de `del` suivi d'un appel explicite au *garbage collector* à l'aide  de `gc.collect()`.

On peut voir là la marque d'un programmeur Java reconverti à Python.

Cependant, ces appels sont inutiles :
1. lorsque l'exécution de la fonction se termine, la variable locale est automatiquement librée (`del` implicite).
2. si le système a besoin de mémoire par exemple à l'étape suivante du préprocessing, le garbage collector sera alors appelé sans qu'il soit besoin d'une instruction explicite.

Nous décidons donc de ne pas conserver ces instructions.

## Groupby

