# Etude et amélioration du kernel Kaggle [“**LightGBM with Simple Features**”](https://www.kaggle.com/code/jsaguiar/lightgbm-with-simple-features)

Le but est de comprendre les traitements effectués par ce noyau, puis de les améliorer (fonctionnellement et techniquement).

Il y a un aller et retour entre la première partie et l'annexe :
* La première partie est d'abord le lieu du *reverse engineering* de la solution de référence et suit l'ordre fonctionnel des étapes de résolution.
* L'annexe focalise sur chaque étape clé de traitement, elle est donc plus technique.
* La première partie applique les conclusions et améliorations justifiées par l'annexe, teste la non régression fonctionnelle et compare les performances pour mettre en évidence les gains par rapport à la solution d'origine.

## A comprendre :

In [None]:
import time
from contextlib import contextmanager

@contextmanager
def timer(title):
    t0 = time.time()
    yield
    print(f"{title} - done in {time.time() - t0:.0f}s")

# Reprise et amélioration des fonctionnalités

## ✔ `APPLICATION`

### Chargement des tables

**Attention** Pour pouvoir comparer les résultats des deux modes de chargement, il faut avoir en tête que `get_application` effectue un `application.sort_index(inplace=True)`.

Il faut donc, en vue de comparer, réaligner les index sur ceux de la v1 : c'est le rôle de la fonction `pepper.pd_utils.align_df2_on_df1`.

Version d'origine :

In [1]:
import pandas as pd
from home_credit.lightgbm_kernel_v2 import free
def load_data_v1(nrows=None):
    # Read data and merge
    data = pd.read_csv('../../dataset/csv/application_train.csv', nrows=nrows)
    test_data = pd.read_csv('../../dataset/csv/application_test.csv', nrows=nrows)
    # print(f"Train samples: {len(df)}, test samples: {len(test_df)}")
    # NB: `append` doesn't exist in current Pandas 2.0, replaced by `concat`
    #     `append` has been deprecated since version 1.3.0 of Pandas (June 2021)
    # NB2: A reset_index() statement in older code (< 1.3.0) is equivalent to reset_index(drop=True)
    # in modern code, due to the change in the default value of the drop parameter.
    # data = data.append(test_data).reset_index()
    data = pd.concat([data, test_data], axis=0)
    data = data.reset_index(drop=True)
    free(test_data)
    return data

Nouvelle version :

Adaptation du type et des valeurs de `TARGET` pour la compatibilité de versions.

In [2]:
from home_credit.lightgbm_kernel_v2 import load_data
import numpy as np
def load_data_v2(nrows=None):
    data = load_data("application", nrows)
    data.TARGET = data.TARGET.astype(object).replace(-1, np.nan)
    return data

Comparaison :

In [3]:
from pepper.pd_utils import align_df2_on_df1
nrows = None  # 1_000
pk_name = "SK_ID_CURR"
data_v1 = load_data_v1(nrows)
data_v2 = align_df2_on_df1(pk_name, data_v1, load_data_v2(nrows))
# display(data_v1)
# display(data_v2)

load C:\Users\franc\Projects\pepper_credit_scoring_tool\dataset\pqt\application_train.pqt
load C:\Users\franc\Projects\pepper_credit_scoring_tool\dataset\pqt\application_test.pqt


In [4]:
from pepper.pd_utils import df_neq
is_diff = df_neq(data_v1, data_v2)
print("n_diffs:", is_diff.sum().sum())

n_diffs: 0


In [5]:
from pepper.pd_utils import check_dtypes_alignment
check_dtypes_alignment(data_v1, data_v2)

dtypes are aligned


### Nettoyage des catégories

Suppression ou correction des aberrations et valeurs manquantes pour les catégories.

In [6]:
def clean_cats_v1(data):
    # Optional: Remove 4 applications with XNA CODE_GENDER (train set)
    # NB > copy() was added to avoid subsequent errors or warnings caused by working on a view
    return data[data['CODE_GENDER'] != 'XNA'].copy()

In [7]:
def clean_cats_v2(data):
    # Optional: Remove 4 applications with XNA CODE_GENDER (train set)
    # NB > Here, there's no need for a copy, as we are working directly in-place
    data.drop(index=data.index[data.CODE_GENDER == "XNA"], inplace=True)
    # For compatibility with V1, but not necessary, it's in-place
    return data

In [8]:
clean_data_v1 = clean_cats_v1(data_v1)
clean_data_v2 = clean_cats_v2(data_v2)

In [9]:
from pepper.pd_utils import df_neq
is_diff = df_neq(clean_data_v1, clean_data_v2)
print("n_diffs:", is_diff.sum().sum())

n_diffs: 0


### Encoding des catégories binaires

In [10]:
def encode_bin_cats_v1(data):    
    # Categorical features with Binary encode (0 or 1; two categories)
    for bin_feature in ['CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY']:
        # Unused `uniques` has been replaced by `_`
        data[bin_feature], _ = pd.factorize(data[bin_feature])

In [11]:
def encode_bin_cats_v2(data):    
    # Categorical features with Binary encode (0 or 1; two categories)
    bin_vars = ['CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY']
    for bin_var in bin_vars:
        data[bin_var] = data[bin_var].astype("category").cat.codes

In [12]:
encode_bin_cats_v1(clean_data_v1)
encode_bin_cats_v2(clean_data_v2)

In [13]:
from pepper.pd_utils import df_neq
is_diff = df_neq(clean_data_v1, clean_data_v2)
print("n_diffs:", is_diff.sum().sum())

n_diffs: 712502


In [14]:
from pepper.pd_utils import check_dtypes_alignment
check_dtypes_alignment(clean_data_v1, clean_data_v2)

dtypes diffs:


CODE_GENDER        int64
FLAG_OWN_CAR       int64
FLAG_OWN_REALTY    int64
dtype: object

application
CODE_GENDER        int8
FLAG_OWN_CAR       int8
FLAG_OWN_REALTY    int8
dtype: object

Rien d'alarmant, les deux techniques d'encoding donnent le même résultat à une permutation des 2 labels près.

Dans tous les cas, nous n'allons pas conserver une telle approche spécifique pour les variables catégorielles binaires mais plutôt utiliser la même approche systématique de one hot encoding que pour toutes les variables catégorielles (voir `one_hot_encode_all_cats` à utiliser conjointement avec `get_categorical_vars`).

Il ne sert effectivement à rien de conserver deux colonnes c'est-à-dire deux variables anticorrélées (voir la section `drop_first` en annexe).

In [15]:
bin_vars = ['CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY']
display(pd.concat([clean_data_v1[bin_vars], clean_data_v2[bin_vars]], axis=1).head(3))

Unnamed: 0,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CODE_GENDER.1,FLAG_OWN_CAR.1,FLAG_OWN_REALTY.1
0,0,0,0,1,0,1
1,1,0,1,0,0,0
2,0,1,0,1,1,1


Pour tout faire rentrer dans l'ordre :

In [16]:
from pepper.pd_utils import df_neq
clean_data_v2.CODE_GENDER = 1 - clean_data_v2.CODE_GENDER
clean_data_v2.FLAG_OWN_REALTY = 1 - clean_data_v2.FLAG_OWN_REALTY
is_diff = df_neq(clean_data_v1, clean_data_v2)
print("n_diffs:", is_diff.sum().sum())

n_diffs: 0


In [17]:
from pepper.pd_utils import check_dtypes_alignment
check_dtypes_alignment(clean_data_v1, clean_data_v2)

dtypes diffs:


CODE_GENDER        int64
FLAG_OWN_CAR       int64
FLAG_OWN_REALTY    int64
dtype: object

application
CODE_GENDER        int8
FLAG_OWN_CAR       int8
FLAG_OWN_REALTY    int8
dtype: object

### One Hot Encoding des catégories n-binaires

In [18]:
from home_credit.lightgbm_kernel import one_hot_encoder
from home_credit.lightgbm_kernel_v2 import hot_encode_cats

ohe_data_v1, catvar_names_v1 = one_hot_encoder(clean_data_v1)
ohe_data_v2, catvar_names_v2 = hot_encode_cats(clean_data_v2, discard_constants=False)

In [19]:
display(ohe_data_v1.shape)
display(ohe_data_v2.shape)

(356251, 255)

(356251, 255)

La comparaison n'est pas vraiment nécessaire, mais soyons prudents :

In [20]:
display(ohe_data_v1.columns)
display(ohe_data_v2.columns)
display(ohe_data_v1.columns.difference(ohe_data_v2.columns))

Index(['SK_ID_CURR', 'TARGET', 'CODE_GENDER', 'FLAG_OWN_CAR',
       'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT',
       'AMT_ANNUITY', 'AMT_GOODS_PRICE',
       ...
       'WALLSMATERIAL_MODE_Mixed', 'WALLSMATERIAL_MODE_Monolithic',
       'WALLSMATERIAL_MODE_Others', 'WALLSMATERIAL_MODE_Panel',
       'WALLSMATERIAL_MODE_Stone, brick', 'WALLSMATERIAL_MODE_Wooden',
       'WALLSMATERIAL_MODE_nan', 'EMERGENCYSTATE_MODE_No',
       'EMERGENCYSTATE_MODE_Yes', 'EMERGENCYSTATE_MODE_nan'],
      dtype='object', length=255)

Index(['SK_ID_CURR', 'TARGET', 'CODE_GENDER', 'FLAG_OWN_CAR',
       'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT',
       'AMT_ANNUITY', 'AMT_GOODS_PRICE',
       ...
       'WALLSMATERIAL_MODE_Mixed', 'WALLSMATERIAL_MODE_Monolithic',
       'WALLSMATERIAL_MODE_Others', 'WALLSMATERIAL_MODE_Panel',
       'WALLSMATERIAL_MODE_Stone, brick', 'WALLSMATERIAL_MODE_Wooden',
       'WALLSMATERIAL_MODE_nan', 'EMERGENCYSTATE_MODE_No',
       'EMERGENCYSTATE_MODE_Yes', 'EMERGENCYSTATE_MODE_nan'],
      dtype='object', length=255)

Index([], dtype='object')

Notons que contrairement à l'encodage des catégories binaires, les deux techniques employées ici donnent les mêmes étiquettes. C'est normal, elles s'appuient toutes deux sur `pandas.get_dummies` :

In [21]:
from pepper.pd_utils import df_neq
is_diff = df_neq(ohe_data_v1, ohe_data_v2)
print("n_diffs:", is_diff.sum().sum())

n_diffs: 0


In [22]:
from pepper.pd_utils import check_dtypes_alignment
check_dtypes_alignment(ohe_data_v1, ohe_data_v2)
# TODO : faire un dataframe, mieux pour comparer

dtypes diffs:


CODE_GENDER        int64
FLAG_OWN_CAR       int64
FLAG_OWN_REALTY    int64
dtype: object

CODE_GENDER        int8
FLAG_OWN_CAR       int8
FLAG_OWN_REALTY    int8
dtype: object

### Nettoyage des variables numériques

Suppression ou correction des aberrations et valeurs manquantes pour les variables numériques.

Vérifions que 365243 n'apparaît que dans `DAYS_EMPLOYED`

In [23]:
cols = ohe_data_v2.columns
days_cols = cols[cols.str.match("DAYS_")]
days_data = ohe_data_v2[days_cols]
display(days_data.max())
# display(days_data[(days_data == 365243).any(axis=1)])

DAYS_BIRTH                 -7338.0
DAYS_EMPLOYED             365243.0
DAYS_REGISTRATION              0.0
DAYS_ID_PUBLISH                0.0
DAYS_LAST_PHONE_CHANGE         0.0
dtype: float64

In [24]:
from home_credit.lightgbm_kernel_v2 import clean_A_nums_v1, clean_A_nums_v2
clean_A_nums_v1(ohe_data_v1)
clean_A_nums_v2(ohe_data_v2)

In [25]:
from pepper.pd_utils import df_neq
is_diff = df_neq(ohe_data_v1, ohe_data_v2)
print("n_diffs:", is_diff.sum().sum())

n_diffs: 0


### Création de variables additionnelles dérivées

In [26]:
from home_credit.lightgbm_kernel_v2 import (
    add_A_derived_features_v1,
    add_A_derived_features as add_A_derived_features_v2
)

add_A_derived_features_v1(ohe_data_v1)
add_A_derived_features_v2(ohe_data_v2)

In [27]:
from pepper.pd_utils import df_neq
is_diff = df_neq(ohe_data_v1, ohe_data_v2)
print("n_diffs:", is_diff.sum().sum())

n_diffs: 0


### Version intégrée

Comparaison des performances à périmètre isofonctionnel.

Attention, handicap sérieux de la v2 car travail supplémentaire pour l'aligner sur la v1 (la v2 charge le dataframe v1 pour permuter ses lignes dans le cadre d'une opération d'alignement).

In [28]:
from home_credit.lightgbm_kernel_v2 import application_train_test_v1
data_v1 = application_train_test_v1()

In [29]:
from home_credit.lightgbm_kernel_v2 import application_train_test_v2
data_v2 = application_train_test_v2()

In [30]:
from pepper.pd_utils import df_neq
is_diff = df_neq(data_v1, data_v2)
print("n_diffs:", is_diff.sum().sum())

n_diffs: 0


In [31]:
data_v1.info()
data_v2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 356251 entries, 0 to 356254
Columns: 247 entries, SK_ID_CURR to PAYMENT_RATE
dtypes: bool(133), float64(72), int64(42)
memory usage: 357.8 MB
<class 'pandas.core.frame.DataFrame'>
Index: 356251 entries, 0 to 356254
Columns: 247 entries, SK_ID_CURR to PAYMENT_RATE
dtypes: bool(133), float64(72), int64(39), int8(3)
memory usage: 350.6 MB


### Version freestyle

## ✔ `BUREAU|BUREAU_BALANCE`

### Chargement des tables

Erreur de conception identifiée dans la version d'origine, concernant l'utilisation de l'argument `nrows`.

In [1]:
from home_credit.lightgbm_kernel_v2 import load_B_tables_v1
data_v1, adj_data_v1 = load_B_tables_v1()

In [2]:
from home_credit.lightgbm_kernel_v2 import load_B_tables_v2
data_v2, adj_data_v2 = load_B_tables_v2()

load C:\Users\franc\Projects\pepper_credit_scoring_tool\dataset\pqt\bureau.pqt
load C:\Users\franc\Projects\pepper_credit_scoring_tool\dataset\pqt\bureau_balance.pqt


In [3]:
from pepper.pd_utils import df_neq
is_diff_bur = df_neq(data_v1, data_v2)
print("n_diffs_bur:", is_diff_bur.sum().sum())

n_diffs_bur: 0


In [4]:
from pepper.pd_utils import check_dtypes_alignment
check_dtypes_alignment(data_v1, data_v2)

dtypes are aligned


In [5]:
from pepper.pd_utils import df_neq
is_diff_bur = df_neq(adj_data_v1, adj_data_v2)
print("n_diffs_bur:", is_diff_bur.sum().sum())

n_diffs_bur: 0


In [6]:
from pepper.pd_utils import check_dtypes_alignment
check_dtypes_alignment(adj_data_v1, adj_data_v2)

dtypes are aligned


### One Hot Encoding des catégories

In [7]:
from home_credit.lightgbm_kernel import one_hot_encoder
from home_credit.lightgbm_kernel_v2 import hot_encode_cats

ohe_data_v1, catvar_names_v1 = one_hot_encoder(data_v1)
ohe_data_v2, catvar_names_v2 = hot_encode_cats(data_v2, discard_constants=False)

ohe_adj_data_v1, adj_catvar_names_v1 = one_hot_encoder(adj_data_v1)
ohe_adj_data_v2, adj_catvar_names_v2 = hot_encode_cats(adj_data_v2, discard_constants=False)

Si on laisse `discard_constants` à sa valeur `True` par défaut, alors apparaît une différence de 3 colonnes : `CREDIT_ACTIVE_nan`, `CREDIT_CURRENCY_nan`, `CREDIT_TYPE_nan`. Ce sont des colonnes constantes, donc sans intérêt. Notre version supprime de telles constantes qui peuvent être engendrées du fait de `dummy_na=True`.

In [8]:
display(ohe_data_v1.shape)
display(ohe_data_v2.shape)
display(ohe_adj_data_v1.shape)
display(ohe_adj_data_v2.shape)

(1716428, 40)

(1716428, 40)

(27299925, 11)

(27299925, 11)

In [9]:
import pandas as pd
cols_diff = ohe_data_v1.columns.difference(ohe_data_v2.columns)
display(cols_diff)
display(pd.DataFrame({
    col : ohe_data_v1[col].value_counts()
    for col in cols_diff
}).T)

Index([], dtype='object')

In [10]:
import pandas as pd
cols_diff = ohe_adj_data_v1.columns.difference(ohe_adj_data_v2.columns)
display(cols_diff)
display(pd.DataFrame({
    col : ohe_data_v1[col].value_counts()
    for col in cols_diff
}).T)

Index([], dtype='object')

In [11]:
from pepper.pd_utils import check_dtypes_alignment
check_dtypes_alignment(ohe_data_v1, ohe_data_v2)

dtypes are aligned


### Agrégation

In [12]:
from home_credit.lightgbm_kernel_v2 import group_BB_by_bur_and_join_to_B_v1
data_v1 = group_BB_by_bur_and_join_to_B_v1(ohe_data_v1, ohe_adj_data_v1, adj_catvar_names_v1)

In [13]:
from home_credit.lightgbm_kernel_v2 import group_BB_by_bur_and_join_to_B_v2
data_v2 = group_BB_by_bur_and_join_to_B_v2(ohe_data_v2, ohe_adj_data_v2, adj_catvar_names_v2)

In [14]:
from pepper.pd_utils import df_neq
is_diff = df_neq(data_v1, data_v2)
print("n_diffs:", is_diff.sum().sum())

n_diffs: 0


In [15]:
from home_credit.lightgbm_kernel_v2 import group_B_by_curr_v1
agg_data_v1 = group_B_by_curr_v1(data_v1, catvar_names_v1, adj_catvar_names_v1)

In [17]:
from home_credit.lightgbm_kernel_v2 import group_B_by_curr_v2
agg_data_v2 = group_B_by_curr_v2(data_v2, catvar_names_v2, adj_catvar_names_v2)

In [18]:
from pepper.pd_utils import df_neq
is_diff = df_neq(agg_data_v1, agg_data_v2)
print("n_diffs:", is_diff.sum().sum())

n_diffs: 0


### Version intégrée

In [19]:
from home_credit.lightgbm_kernel_v2 import bureau_and_balance_v1
agg_data_v1 = bureau_and_balance_v1()

In [20]:
from home_credit.lightgbm_kernel_v2 import bureau_and_balance_v2
agg_data_v2 = bureau_and_balance_v2()

In [21]:
from pepper.pd_utils import df_neq
is_diff = df_neq(agg_data_v1, agg_data_v2)
print("n_diffs:", is_diff.sum().sum())

n_diffs: 0


In [22]:
agg_data_v1.info()
agg_data_v2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 305811 entries, 100001 to 456255
Columns: 112 entries, BURO_DAYS_CREDIT_MIN to CLOSED_MONTHS_BALANCE_SIZE_SUM
dtypes: float64(108), int64(4)
memory usage: 263.6 MB
<class 'pandas.core.frame.DataFrame'>
Index: 305811 entries, 100001 to 456255
Columns: 112 entries, BURO_DAYS_CREDIT_MIN to CLOSED_MONTHS_BALANCE_SIZE_SUM
dtypes: float64(108), int64(4)
memory usage: 263.6 MB


### Version freestyle

## ✔ `PREVIOUS_APPLICATION`

### Chargement des tables

In [37]:
from home_credit.lightgbm_kernel_v2 import load_PA_table_v1
data_v1 = load_PA_table_v1()

In [38]:
from home_credit.lightgbm_kernel_v2 import load_PA_table_v2
data_v2 = load_PA_table_v2()

load C:\Users\franc\Projects\pepper_credit_scoring_tool\dataset\pqt\previous_application.pqt


In [39]:
from pepper.pd_utils import df_neq
is_diff = df_neq(data_v1, data_v2)
print("n_diffs_bur:", is_diff.sum().sum())

n_diffs_bur: 0


In [40]:
from pepper.pd_utils import check_dtypes_alignment
check_dtypes_alignment(data_v1, data_v2)

dtypes are aligned


### One Hot Encoding des catégories n-binaires

In [41]:
from home_credit.lightgbm_kernel import one_hot_encoder
from home_credit.lightgbm_kernel_v2 import hot_encode_cats

ohe_data_v1, catvar_names_v1 = one_hot_encoder(data_v1)
ohe_data_v2, catvar_names_v2 = hot_encode_cats(data_v2, discard_constants=False)

Si on laisse `discard_constants` à sa valeur `True` par défaut, alors apparaît une différence de 14 colonnes `_nan` générées. Ce sont des colonnes constantes, donc sans intérêt. Notre version supprime de telles constantes qui peuvent être engendrées par `dummy_na=True`.

In [42]:
display(ohe_data_v1.shape)
display(ohe_data_v2.shape)

(1670214, 180)

(1670214, 180)

In [43]:
cols_diff = ohe_data_v1.columns.difference(ohe_data_v2.columns)
display(cols_diff)

Index([], dtype='object')

In [44]:
import pandas as pd
display(pd.DataFrame({
    col : ohe_data_v1[col].value_counts()
    for col in cols_diff
}).T)

### Nettoyage des variables numériques

Suppression ou correction des aberrations et valeurs manquantes pour les variables numériques.

La valeur 365243 apparaît dans toutes les variables `DAYS` sauf `DAYS_DECISION`.

Dans notre approche, on sélectionne pour effectuer la correction, les seules colonnes `DAYS` plutôt que d'appliquer la transformation à l'ensemble du dataframe, car nous avons découvert à nos dépens que certains `SK_ID` ont cette valeur.

In [45]:
cols = ohe_data_v2.columns
days_cols = cols[cols.str.match("DAYS_")]
days_data = ohe_data_v2[days_cols]
display(days_data.max())
# display(days_data[(days_data == 365243).any(axis=1)])

DAYS_DECISION                    -1.0
DAYS_FIRST_DRAWING           365243.0
DAYS_FIRST_DUE               365243.0
DAYS_LAST_DUE_1ST_VERSION    365243.0
DAYS_LAST_DUE                365243.0
DAYS_TERMINATION             365243.0
dtype: float64

In [46]:
from home_credit.lightgbm_kernel_v2 import clean_PA_nums_v1, clean_PA_nums_v2
clean_PA_nums_v1(ohe_data_v1)
clean_PA_nums_v2(ohe_data_v2)

In [47]:
from pepper.pd_utils import df_neq
is_diff = df_neq(ohe_data_v1, ohe_data_v2)
print("n_diffs:", is_diff.sum().sum())

n_diffs: 0


### Création de variables additionnelles dérivées

In [48]:
from home_credit.lightgbm_kernel_v2 import add_PA_derived_features_v1
add_PA_derived_features_v1(ohe_data_v1)

In [49]:
from home_credit.lightgbm_kernel_v2 import add_PA_derived_features as add_PA_derived_features_v2
add_PA_derived_features_v2(ohe_data_v2)

In [50]:
from pepper.pd_utils import df_neq
is_diff = df_neq(ohe_data_v1, ohe_data_v2)
print("n_diffs:", is_diff.sum().sum())

n_diffs: 0


In [51]:
from pepper.pd_utils import check_dtypes_alignment
check_dtypes_alignment(ohe_data_v1, ohe_data_v2)

dtypes diffs:


NAME_CONTRACT_TYPE_Cash loans                      bool
NAME_CONTRACT_TYPE_Consumer loans                  bool
NAME_CONTRACT_TYPE_Revolving loans                 bool
NAME_CONTRACT_TYPE_XNA                             bool
NAME_CONTRACT_TYPE_nan                             bool
                                                   ... 
PRODUCT_COMBINATION_POS mobile with interest       bool
PRODUCT_COMBINATION_POS mobile without interest    bool
PRODUCT_COMBINATION_POS other with interest        bool
PRODUCT_COMBINATION_POS others without interest    bool
PRODUCT_COMBINATION_nan                            bool
Length: 159, dtype: object

NAME_CONTRACT_TYPE_Cash loans                      int8
NAME_CONTRACT_TYPE_Consumer loans                  int8
NAME_CONTRACT_TYPE_Revolving loans                 int8
NAME_CONTRACT_TYPE_XNA                             int8
NAME_CONTRACT_TYPE_nan                             int8
                                                   ... 
PRODUCT_COMBINATION_POS mobile with interest       int8
PRODUCT_COMBINATION_POS mobile without interest    int8
PRODUCT_COMBINATION_POS other with interest        int8
PRODUCT_COMBINATION_POS others without interest    int8
PRODUCT_COMBINATION_nan                            int8
Length: 159, dtype: object

### Agrégation

In [52]:
from home_credit.lightgbm_kernel_v2 import group_PA_by_curr_v1
agg_data_v1 = group_PA_by_curr_v1(ohe_data_v1, catvar_names_v1)

In [53]:
from home_credit.lightgbm_kernel_v2 import group_PA_by_curr_v2
agg_data_v2 = group_PA_by_curr_v2(ohe_data_v2, catvar_names_v2)

In [54]:
from pepper.pd_utils import df_neq
is_diff = df_neq(agg_data_v1, agg_data_v2)
print("n_diffs:", is_diff.sum().sum())

n_diffs: 0


### Version intégrée

In [55]:
from home_credit.lightgbm_kernel_v2 import previous_application_v1
agg_data_v1 = previous_application_v1()

In [56]:
from home_credit.lightgbm_kernel_v2 import previous_application_v2
agg_data_v2 = previous_application_v2()

In [57]:
from pepper.pd_utils import df_neq
is_diff = df_neq(agg_data_v1, agg_data_v2)
print("n_diffs:", is_diff.sum().sum())

n_diffs: 0


In [58]:
agg_data_v1.info()
agg_data_v2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 338857 entries, 100001 to 456255
Columns: 233 entries, PREV_AMT_ANNUITY_MIN to REFUSED_CNT_PAYMENT_SUM
dtypes: float64(229), int64(4)
memory usage: 605.0 MB
<class 'pandas.core.frame.DataFrame'>
Index: 338857 entries, 100001 to 456255
Columns: 233 entries, PREV_AMT_ANNUITY_MIN to REFUSED_CNT_PAYMENT_SUM
dtypes: float64(229), int64(4)
memory usage: 605.0 MB


### Version freestyle

## ✔ `POSH_CASH_BALANCE`

### Chargement des tables

In [23]:
from home_credit.lightgbm_kernel_v2 import load_PCB_table_v1
data_v1 = load_PCB_table_v1()

In [24]:
from home_credit.lightgbm_kernel_v2 import load_PCB_table_v2
data_v2 = load_PCB_table_v2()

load C:\Users\franc\Projects\pepper_credit_scoring_tool\dataset\pqt\POS_CASH_balance.pqt


In [25]:
from pepper.pd_utils import df_neq
is_diff = df_neq(data_v1, data_v2)
print("n_diffs:", is_diff.sum().sum())

n_diffs: 0


In [26]:
from pepper.pd_utils import check_dtypes_alignment
check_dtypes_alignment(data_v1, data_v2)

dtypes are aligned


### One Hot Encoding des catégories n-binaires

In [27]:
from home_credit.lightgbm_kernel import one_hot_encoder
from home_credit.lightgbm_kernel_v2 import hot_encode_cats

ohe_data_v1, catvar_names_v1 = one_hot_encoder(data_v1)
ohe_data_v2, catvar_names_v2 = hot_encode_cats(data_v2, discard_constants=False)

Si on laisse `discard_constants` à sa valeur `True` par défaut, alors apparaît une différence de 3 colonnes : `CREDIT_ACTIVE_nan`, `CREDIT_CURRENCY_nan`, `CREDIT_TYPE_nan`. Ce sont des colonnes constantes, donc sans intérêt. Notre version supprime de telles constantes qui pourraient être engendrées par `dummy_na=True`.

In [12]:
display(ohe_data_v1.shape)
display(ohe_data_v2.shape)

(3840312, 30)

(3840312, 30)

In [29]:
cols_diff = ohe_data_v1.columns.difference(ohe_data_v2.columns)
display(cols_diff)

Index([], dtype='object')

In [30]:
import pandas as pd
display(pd.DataFrame({
    col : ohe_data_v1[col].value_counts()
    for col in cols_diff
}).T)

In [31]:
from pepper.pd_utils import check_dtypes_alignment
check_dtypes_alignment(ohe_data_v1, ohe_data_v2)

dtypes are aligned


### Agrégation

In [32]:
from home_credit.lightgbm_kernel_v2 import group_PCB_by_curr_v1
agg_data_v1 = group_PCB_by_curr_v1(ohe_data_v1, catvar_names_v1)

In [33]:
from home_credit.lightgbm_kernel_v2 import group_PCB_by_curr_v2
agg_data_v2 = group_PCB_by_curr_v2(ohe_data_v2, catvar_names_v2)

In [34]:
from pepper.pd_utils import df_neq
is_diff = df_neq(agg_data_v1, agg_data_v1)
print("n_diffs:", is_diff.sum().sum())

n_diffs: 0


### Version intégrée

In [1]:
from home_credit.lightgbm_kernel_v2 import pos_cash_balance_v1
agg_data_v1 = pos_cash_balance_v1()

In [2]:
from home_credit.lightgbm_kernel_v2 import pos_cash_balance_v2
agg_data_v2 = pos_cash_balance_v2()

load C:\Users\franc\Projects\pepper_credit_scoring_tool\dataset\pqt\POS_CASH_balance.pqt


In [3]:
from pepper.pd_utils import df_neq
is_diff = df_neq(agg_data_v1, agg_data_v2)
print("n_diffs:", is_diff.sum().sum())

n_diffs: 0


In [4]:
agg_data_v1.info()
agg_data_v2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 337252 entries, 100001 to 456255
Data columns (total 17 columns):
 #   Column                                               Non-Null Count   Dtype  
---  ------                                               --------------   -----  
 0   POS_MONTHS_BALANCE_MAX                               337252 non-null  int64  
 1   POS_MONTHS_BALANCE_MEAN                              337252 non-null  float64
 2   POS_MONTHS_BALANCE_SIZE                              337252 non-null  int64  
 3   POS_SK_DPD_MAX                                       337252 non-null  int64  
 4   POS_SK_DPD_MEAN                                      337252 non-null  float64
 5   POS_SK_DPD_DEF_MAX                                   337252 non-null  int64  
 6   POS_SK_DPD_DEF_MEAN                                  337252 non-null  float64
 7   POS_NAME_CONTRACT_STATUS_Active_MEAN                 337252 non-null  float64
 8   POS_NAME_CONTRACT_STATUS_Amortized debt_MEAN         3

## ✔ `CREDIT_CARD_BALANCE`

### Chargement des tables

In [5]:
from home_credit.lightgbm_kernel_v2 import load_CCB_table_v1
data_v1 = load_CCB_table_v1()

In [6]:
from home_credit.lightgbm_kernel_v2 import load_CCB_table_v2
data_v2 = load_CCB_table_v2()

load C:\Users\franc\Projects\pepper_credit_scoring_tool\dataset\pqt\credit_card_balance.pqt


In [7]:
from pepper.pd_utils import df_neq
is_diff = df_neq(data_v1, data_v2)
print("n_diffs:", is_diff.sum().sum())

n_diffs: 0


In [8]:
from pepper.pd_utils import check_dtypes_alignment
check_dtypes_alignment(data_v1, data_v2)

dtypes are aligned


### One Hot Encoding des catégories n-binaires

In [13]:
from home_credit.lightgbm_kernel import one_hot_encoder
from home_credit.lightgbm_kernel_v2 import hot_encode_cats

ohe_data_v1, catvar_names_v1 = one_hot_encoder(data_v1)
ohe_data_v2, catvar_names_v2 = hot_encode_cats(data_v2, discard_constants=False)

Si on laisse `discard_constants` à sa valeur `True` par défaut, alors apparaît une différence de 3 colonnes : `CREDIT_ACTIVE_nan`, `CREDIT_CURRENCY_nan`, `CREDIT_TYPE_nan`. C sont des colonnes constantes, donc sans intérêt. Notre version supprime de telles constantes qui pourraient être engendrées par `dummy_na=True`.

In [14]:
display(ohe_data_v1.shape)
display(ohe_data_v2.shape)

(3840312, 30)

(3840312, 30)

In [15]:
cols_diff = ohe_data_v1.columns.difference(ohe_data_v2.columns)
display(cols_diff)

Index([], dtype='object')

In [16]:
import pandas as pd
display(pd.DataFrame({
    col : ohe_data_v1[col].value_counts()
    for col in cols_diff
}).T)

### Agrégation

In [17]:
from home_credit.lightgbm_kernel_v2 import group_CCB_by_curr_v1
agg_data_v1 = group_CCB_by_curr_v1(ohe_data_v1, catvar_names_v1)

In [18]:
from home_credit.lightgbm_kernel_v2 import group_CCB_by_curr_v2
agg_data_v2 = group_CCB_by_curr_v2(ohe_data_v2, catvar_names_v2)

In [19]:
from pepper.pd_utils import df_neq
is_diff = df_neq(agg_data_v1, agg_data_v1)
print("n_diffs:", is_diff.sum().sum())

n_diffs: 0


### Version intégrée

In [3]:
from home_credit.lightgbm_kernel_v2 import credit_card_balance_v1
agg_data_v1 = credit_card_balance_v1()

In [4]:
from home_credit.lightgbm_kernel_v2 import credit_card_balance_v2
agg_data_v2 = credit_card_balance_v2()

load C:\Users\franc\Projects\pepper_credit_scoring_tool\dataset\pqt\credit_card_balance.pqt


In [5]:
from pepper.pd_utils import df_neq
is_diff = df_neq(agg_data_v1, agg_data_v2)
print("n_diffs:", is_diff.sum().sum())

n_diffs: 0


In [6]:
agg_data_v1.info()
agg_data_v2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 103558 entries, 100006 to 456250
Columns: 136 entries, CC_MONTHS_BALANCE_MIN to CC_COUNT
dtypes: bool(14), float64(99), int64(23)
memory usage: 98.6 MB
<class 'pandas.core.frame.DataFrame'>
Index: 103558 entries, 100006 to 456250
Columns: 136 entries, CC_MONTHS_BALANCE_MIN to CC_COUNT
dtypes: bool(14), float64(99), int64(23)
memory usage: 98.6 MB


In [7]:
display(agg_data_v1)

Unnamed: 0_level_0,CC_MONTHS_BALANCE_MIN,CC_MONTHS_BALANCE_MAX,CC_MONTHS_BALANCE_MEAN,CC_MONTHS_BALANCE_SUM,CC_MONTHS_BALANCE_VAR,CC_AMT_BALANCE_MIN,CC_AMT_BALANCE_MAX,CC_AMT_BALANCE_MEAN,CC_AMT_BALANCE_SUM,CC_AMT_BALANCE_VAR,...,CC_NAME_CONTRACT_STATUS_Sent proposal_MAX,CC_NAME_CONTRACT_STATUS_Sent proposal_MEAN,CC_NAME_CONTRACT_STATUS_Sent proposal_SUM,CC_NAME_CONTRACT_STATUS_Sent proposal_VAR,CC_NAME_CONTRACT_STATUS_Signed_MIN,CC_NAME_CONTRACT_STATUS_Signed_MAX,CC_NAME_CONTRACT_STATUS_Signed_MEAN,CC_NAME_CONTRACT_STATUS_Signed_SUM,CC_NAME_CONTRACT_STATUS_Signed_VAR,CC_COUNT
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100006,-6,-1,-3.5,-21,3.5,0.000,0.000,0.000000,0.000,0.000000e+00,...,False,0.0,0,0.0,False,False,0.0,0,0.0,6
100011,-75,-2,-38.5,-2849,462.5,0.000,189000.000,54482.111149,4031676.225,4.641321e+09,...,False,0.0,0,0.0,False,False,0.0,0,0.0,74
100013,-96,-1,-48.5,-4656,776.0,0.000,161420.220,18159.919219,1743352.245,1.869473e+09,...,False,0.0,0,0.0,False,False,0.0,0,0.0,96
100021,-18,-2,-10.0,-170,25.5,0.000,0.000,0.000000,0.000,0.000000e+00,...,False,0.0,0,0.0,False,False,0.0,0,0.0,17
100023,-11,-4,-7.5,-60,6.0,0.000,0.000,0.000000,0.000,0.000000e+00,...,False,0.0,0,0.0,False,False,0.0,0,0.0,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
456244,-41,-1,-21.0,-861,143.5,0.000,453627.675,131834.730732,5405223.960,3.295703e+10,...,False,0.0,0,0.0,False,False,0.0,0,0.0,41
456246,-9,-2,-5.5,-44,6.0,0.000,43490.115,13136.731875,105093.855,3.335511e+08,...,False,0.0,0,0.0,False,False,0.0,0,0.0,8
456247,-96,-2,-49.0,-4655,760.0,0.000,190202.130,23216.396211,2205557.640,3.200871e+09,...,False,0.0,0,0.0,False,False,0.0,0,0.0,95
456248,-24,-2,-13.0,-299,46.0,0.000,0.000,0.000000,0.000,0.000000e+00,...,False,0.0,0,0.0,False,False,0.0,0,0.0,23


## ✔ `INSTALLMENTS_PAYMENTS`

### Chargement des tables

In [5]:
from home_credit.lightgbm_kernel_v2 import load_IP_table_v1
data_v1 = load_IP_table_v1()

In [6]:
from home_credit.lightgbm_kernel_v2 import load_IP_table_v2
data_v2 = load_IP_table_v2()

load C:\Users\franc\Projects\pepper_credit_scoring_tool\dataset\pqt\installments_payments.pqt


In [7]:
from pepper.pd_utils import df_neq
is_diff = df_neq(data_v1, data_v2)
print("n_diffs:", is_diff.sum().sum())

n_diffs: 0


In [8]:
from pepper.pd_utils import check_dtypes_alignment
check_dtypes_alignment(data_v1, data_v2)

dtypes are aligned


### One Hot Encoding des catégories n-binaires

In [9]:
from home_credit.lightgbm_kernel import one_hot_encoder
from home_credit.lightgbm_kernel_v2 import hot_encode_cats

ohe_data_v1, catvar_names_v1 = one_hot_encoder(data_v1)
ohe_data_v2, catvar_names_v2 = hot_encode_cats(data_v2, discard_constants=False)

In [10]:
display(ohe_data_v1.shape)
display(ohe_data_v2.shape)

(13605401, 8)

(13605401, 8)

In [11]:
cols_diff = ohe_data_v1.columns.difference(ohe_data_v2.columns)
display(cols_diff)

Index([], dtype='object')

In [13]:
import pandas as pd
display(pd.DataFrame({
    col : ohe_data_v1[col].value_counts()
    for col in cols_diff
}).T)

In [14]:
from pepper.pd_utils import check_dtypes_alignment
check_dtypes_alignment(ohe_data_v1, ohe_data_v2)

dtypes are aligned


### Création de variables additionnelles dérivées

In [15]:
from home_credit.lightgbm_kernel_v2 import add_IP_derived_features_v1
add_IP_derived_features_v1(ohe_data_v1)

In [16]:
from home_credit.lightgbm_kernel_v2 import add_IP_derived_features as add_IP_derived_features_v2
add_IP_derived_features_v2(ohe_data_v2)

In [17]:
from pepper.pd_utils import df_neq
is_diff = df_neq(ohe_data_v1, ohe_data_v2)
print("n_diffs:", is_diff.sum().sum())

n_diffs: 0


### Agrégation

In [18]:
from home_credit.lightgbm_kernel_v2 import group_IP_by_curr_v1
agg_data_v1 = group_IP_by_curr_v1(ohe_data_v1, catvar_names_v1)

In [19]:
from home_credit.lightgbm_kernel_v2 import group_IP_by_curr_v2
agg_data_v2 = group_IP_by_curr_v2(ohe_data_v2, catvar_names_v2)

In [20]:
from pepper.pd_utils import df_neq
is_diff = df_neq(agg_data_v1, agg_data_v2)
print("n_diffs:", is_diff.sum().sum())

n_diffs: 0


### Version intégrée

In [1]:
from home_credit.lightgbm_kernel_v2 import installments_payments_v1
agg_data_v1 = installments_payments_v1()

In [2]:
from home_credit.lightgbm_kernel_v2 import installments_payments_v2
agg_data_v2 = installments_payments_v2()

load C:\Users\franc\Projects\pepper_credit_scoring_tool\dataset\pqt\installments_payments.pqt


In [3]:
from pepper.pd_utils import df_neq
is_diff = df_neq(agg_data_v1, agg_data_v2)
print("n_diffs:", is_diff.sum().sum())

n_diffs: 0


In [4]:
agg_data_v1.info()
agg_data_v2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 339587 entries, 100001 to 456255
Data columns (total 26 columns):
 #   Column                                 Non-Null Count   Dtype  
---  ------                                 --------------   -----  
 0   INSTAL_NUM_INSTALMENT_VERSION_NUNIQUE  339587 non-null  int64  
 1   INSTAL_DPD_MAX                         339587 non-null  float64
 2   INSTAL_DPD_MEAN                        339587 non-null  float64
 3   INSTAL_DPD_SUM                         339587 non-null  float64
 4   INSTAL_DBD_MAX                         339587 non-null  float64
 5   INSTAL_DBD_MEAN                        339587 non-null  float64
 6   INSTAL_DBD_SUM                         339587 non-null  float64
 7   INSTAL_PAYMENT_PERC_MAX                339578 non-null  float64
 8   INSTAL_PAYMENT_PERC_MEAN               339559 non-null  float64
 9   INSTAL_PAYMENT_PERC_SUM                339568 non-null  float64
 10  INSTAL_PAYMENT_PERC_VAR                338591 non-null  

# `ALL`

Test d'intégration et production des fichiers de données prétraitées à utiliser en entrée de la modélisation.

## Version d'origine

Sur un échantillon de $10\,000$ lignes.

In [4]:
from home_credit.lightgbm_kernel_v2 import main_preproc
from pepper.utils import cls
data = main_preproc(nrows=10_000, version=1, verbosity=1)
cls()
display(data)

Unnamed: 0,SK_ID_CURR,TARGET,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,...,CC_NAME_CONTRACT_STATUS_Sent_proposal_MIN_False,CC_NAME_CONTRACT_STATUS_Sent_proposal_MIN_True,CC_NAME_CONTRACT_STATUS_Sent_proposal_MIN_nan,CC_NAME_CONTRACT_STATUS_Sent_proposal_MAX_False,CC_NAME_CONTRACT_STATUS_Sent_proposal_MAX_True,CC_NAME_CONTRACT_STATUS_Sent_proposal_MAX_nan,CC_NAME_CONTRACT_STATUS_Signed_MIN_False,CC_NAME_CONTRACT_STATUS_Signed_MIN_nan,CC_NAME_CONTRACT_STATUS_Signed_MAX_False,CC_NAME_CONTRACT_STATUS_Signed_MAX_nan
267898,410412,0,0,360000.0,427176.0,21942.0,306000.0,0.046220,-20202,-5440,...,True,False,False,True,False,False,True,False,True,False
174136,301800,0,0,180000.0,578979.0,25632.0,517500.0,0.035792,-17482,-7000,...,False,False,True,False,False,True,False,True,False,True
352198,427114,-1,0,121500.0,431280.0,20875.5,360000.0,0.010276,-18567,-11136,...,False,False,True,False,False,True,False,True,False,True
278976,423210,1,1,180000.0,1406362.5,50512.5,1260000.0,0.046220,-15100,-4095,...,False,False,True,False,False,True,False,True,False,True
282066,426705,0,0,211500.0,675000.0,49248.0,675000.0,0.026392,-17171,-1852,...,False,False,True,False,False,True,False,True,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149443,273263,1,0,225000.0,755190.0,38556.0,675000.0,0.035792,-18253,-837,...,False,False,True,False,False,True,False,True,False,True
69101,180145,0,0,76500.0,273636.0,15408.0,247500.0,0.009549,-20189,365243,...,False,False,True,False,False,True,False,True,False,True
63925,174128,0,0,202500.0,517500.0,16690.5,517500.0,0.026392,-21603,-2241,...,False,False,True,False,False,True,False,True,False,True
183985,313250,0,0,405000.0,1094688.0,39451.5,945000.0,0.046220,-20427,-1983,...,True,False,False,True,False,False,True,False,True,False


In [2]:
from home_credit.lightgbm_kernel_v2 import main_preproc
from pepper.env import get_tmp_dir
from pepper.utils import cls
import os
data = main_preproc(nrows=None, version=1, verbosity=1)
# cls()
display(data)
filepath = os.path.join(get_tmp_dir(), "prep_dataset_v1.pqt")
data.to_parquet(filepath, engine="pyarrow", compression="gzip")

'Preprocess `bureau` - Version 1'

[1melapsed time[0m: 23 s, 609 ms


'Preprocess `previous_application` - Version 1'

[1melapsed time[0m: 27 s, 709 ms


'Preprocess `pos_cash_balance` - Version 1'

[1melapsed time[0m: 15 s, 541 ms


'Preprocess `credit_card_balance` - Version 1'

[1melapsed time[0m: 18 s, 402 ms


'Preprocess `installments_payments` - Version 1'

[1melapsed time[0m: 35 s, 104 ms


Unnamed: 0,SK_ID_CURR,TARGET,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,...,CC_NAME_CONTRACT_STATUS_Sent_proposal_MIN_nan,CC_NAME_CONTRACT_STATUS_Sent_proposal_MAX_False,CC_NAME_CONTRACT_STATUS_Sent_proposal_MAX_True,CC_NAME_CONTRACT_STATUS_Sent_proposal_MAX_nan,CC_NAME_CONTRACT_STATUS_Signed_MIN_False,CC_NAME_CONTRACT_STATUS_Signed_MIN_True,CC_NAME_CONTRACT_STATUS_Signed_MIN_nan,CC_NAME_CONTRACT_STATUS_Signed_MAX_False,CC_NAME_CONTRACT_STATUS_Signed_MAX_True,CC_NAME_CONTRACT_STATUS_Signed_MAX_nan
0,100002,1,0,202500.0,406597.5,24700.5,351000.0,0.018801,-9461,-637,...,True,False,False,True,False,False,True,False,False,True
1,100003,0,0,270000.0,1293502.5,35698.5,1129500.0,0.003541,-16765,-1188,...,True,False,False,True,False,False,True,False,False,True
2,100004,0,0,67500.0,135000.0,6750.0,135000.0,0.010032,-19046,-225,...,True,False,False,True,False,False,True,False,False,True
3,100006,0,0,135000.0,312682.5,29686.5,297000.0,0.008019,-19005,-3039,...,False,True,False,False,True,False,False,True,False,False
4,100007,0,0,121500.0,513000.0,21865.5,513000.0,0.028663,-19932,-3038,...,True,False,False,True,False,False,True,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
356250,456221,-1,0,121500.0,412560.0,17473.5,270000.0,0.002042,-19970,-5169,...,True,False,False,True,False,False,True,False,False,True
356251,456222,-1,2,157500.0,622413.0,31909.5,495000.0,0.035792,-11186,-1149,...,True,False,False,True,False,False,True,False,False,True
356252,456223,-1,1,202500.0,315000.0,33205.5,315000.0,0.026392,-15922,-3037,...,True,False,False,True,False,False,True,False,False,True
356253,456224,-1,0,225000.0,450000.0,25128.0,450000.0,0.018850,-13968,-2731,...,True,False,False,True,False,False,True,False,False,True


## Version améliorée

Sur un échantillon de $10\,000$ lignes.

La compétition n'est ici pas équitable : la sélection des `nrows` par la version d'origine se fait sur les premières lignes du fichier, tandis que la seconde version prélève un véritable échantillon aléatoire.

In [6]:
from home_credit.lightgbm_kernel_v2 import main_preproc
from pepper.utils import cls
data = main_preproc(nrows=10_000, version=2, verbosity=1)
cls()
display(data)

Unnamed: 0,SK_ID_CURR,TARGET,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,...,CC_NAME_CONTRACT_STATUS_Demand_MAX_False,CC_NAME_CONTRACT_STATUS_Demand_MAX_nan,CC_NAME_CONTRACT_STATUS_Sent_proposal_MIN_False,CC_NAME_CONTRACT_STATUS_Sent_proposal_MIN_nan,CC_NAME_CONTRACT_STATUS_Sent_proposal_MAX_False,CC_NAME_CONTRACT_STATUS_Sent_proposal_MAX_nan,CC_NAME_CONTRACT_STATUS_Signed_MIN_False,CC_NAME_CONTRACT_STATUS_Signed_MIN_nan,CC_NAME_CONTRACT_STATUS_Signed_MAX_False,CC_NAME_CONTRACT_STATUS_Signed_MAX_nan
355725,452195,-1,0,85500.0,225000.0,17410.5,225000.0,0.009334,-19086,-135,...,False,True,False,True,False,True,False,True,False,True
223972,359414,0,0,450000.0,1012500.0,56538.0,1012500.0,0.007274,-20572,365243,...,False,True,False,True,False,True,False,True,False,True
57625,166783,0,0,112500.0,405000.0,20677.5,405000.0,0.010556,-20355,-3765,...,False,True,False,True,False,True,False,True,False,True
217502,352014,1,0,153000.0,64692.0,4630.5,54000.0,0.002042,-19877,365243,...,False,True,False,True,False,True,False,True,False,True
34147,139571,0,0,90000.0,755190.0,30078.0,675000.0,0.028663,-22124,365243,...,False,True,False,True,False,True,False,True,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
325104,227701,-1,2,225000.0,1096020.0,52857.0,900000.0,0.031329,-12953,-997,...,True,False,True,False,True,False,True,False,True,False
295603,442478,0,0,112500.0,879480.0,25843.5,630000.0,0.028663,-20294,365243,...,False,True,False,True,False,True,False,True,False,True
59018,168404,0,0,103500.0,1192500.0,34996.5,1192500.0,0.018209,-10193,-1957,...,False,True,False,True,False,True,False,True,False,True
277316,421341,0,0,112500.0,760225.5,32337.0,679500.0,0.007020,-15570,-2775,...,False,True,False,True,False,True,False,True,False,True


Sur l'ensemble du jeu de données. Magré un handicap concédé à la v1 (alignement de la table `application` sur celle de la v1), c'est plus rapide.

In [3]:
from home_credit.lightgbm_kernel_v2 import main_preproc
from pepper.env import get_tmp_dir
from pepper.utils import cls
import os
data = main_preproc(nrows=None, version=2, verbosity=1)
#cls()
display(data)
filepath = os.path.join(get_tmp_dir(), "prep_dataset_v2.pqt")
data.to_parquet(filepath, engine="pyarrow", compression="gzip")

'Preprocess `bureau` - Version 2'

load C:\Users\franc\Projects\pepper_credit_scoring_tool\dataset\pqt\bureau.pqt
load C:\Users\franc\Projects\pepper_credit_scoring_tool\dataset\pqt\bureau_balance.pqt
[1melapsed time[0m: 26 s, 820 ms


'Preprocess `previous_application` - Version 2'

load C:\Users\franc\Projects\pepper_credit_scoring_tool\dataset\pqt\previous_application.pqt
[1melapsed time[0m: 27 s, 395 ms


'Preprocess `pos_cash_balance` - Version 2'

load C:\Users\franc\Projects\pepper_credit_scoring_tool\dataset\pqt\POS_CASH_balance.pqt
[1melapsed time[0m: 14 s, 107 ms


'Preprocess `credit_card_balance` - Version 2'

load C:\Users\franc\Projects\pepper_credit_scoring_tool\dataset\pqt\credit_card_balance.pqt
[1melapsed time[0m: 12 s, 236 ms


'Preprocess `installments_payments` - Version 2'

load C:\Users\franc\Projects\pepper_credit_scoring_tool\dataset\pqt\installments_payments.pqt
[1melapsed time[0m: 19 s, 640 ms


Unnamed: 0,SK_ID_CURR,TARGET,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,...,CC_NAME_CONTRACT_STATUS_Sent_proposal_MIN_nan,CC_NAME_CONTRACT_STATUS_Sent_proposal_MAX_False,CC_NAME_CONTRACT_STATUS_Sent_proposal_MAX_True,CC_NAME_CONTRACT_STATUS_Sent_proposal_MAX_nan,CC_NAME_CONTRACT_STATUS_Signed_MIN_False,CC_NAME_CONTRACT_STATUS_Signed_MIN_True,CC_NAME_CONTRACT_STATUS_Signed_MIN_nan,CC_NAME_CONTRACT_STATUS_Signed_MAX_False,CC_NAME_CONTRACT_STATUS_Signed_MAX_True,CC_NAME_CONTRACT_STATUS_Signed_MAX_nan
0,100002,1,0,202500.0,406597.5,24700.5,351000.0,0.018801,-9461,-637,...,True,False,False,True,False,False,True,False,False,True
1,100003,0,0,270000.0,1293502.5,35698.5,1129500.0,0.003541,-16765,-1188,...,True,False,False,True,False,False,True,False,False,True
2,100004,0,0,67500.0,135000.0,6750.0,135000.0,0.010032,-19046,-225,...,True,False,False,True,False,False,True,False,False,True
3,100006,0,0,135000.0,312682.5,29686.5,297000.0,0.008019,-19005,-3039,...,False,True,False,False,True,False,False,True,False,False
4,100007,0,0,121500.0,513000.0,21865.5,513000.0,0.028663,-19932,-3038,...,True,False,False,True,False,False,True,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
356250,456221,-1,0,121500.0,412560.0,17473.5,270000.0,0.002042,-19970,-5169,...,True,False,False,True,False,False,True,False,False,True
356251,456222,-1,2,157500.0,622413.0,31909.5,495000.0,0.035792,-11186,-1149,...,True,False,False,True,False,False,True,False,False,True
356252,456223,-1,1,202500.0,315000.0,33205.5,315000.0,0.026392,-15922,-3037,...,True,False,False,True,False,False,True,False,False,True
356253,456224,-1,0,225000.0,450000.0,25128.0,450000.0,0.018850,-13968,-2731,...,True,False,False,True,False,False,True,False,False,True


## `KFOLD_LIGHTGBM`

In [None]:
from home_credit.lightgbm_kernel import kfold_lightgbm as kfold_lightgbm_v1
from home_credit.lightgbm_kernel_v2 import kfold_lightgbm as kfold_lightgbm_v2

# Compare

## `DISPLAY_IMPORTANCES`

In [None]:
from home_credit.lightgbm_kernel import display_importances as display_importances_v1
from home_credit.lightgbm_kernel_v2 import display_importances as display_importances_v2

# Compare

## `MAIN`

In [1]:
from home_credit.lightgbm_kernel import main as main_v1
from home_credit.lightgbm_kernel_v2 import main as main_v2

# Compare

In [14]:
from pepper.utils import cls
print("A")
print("B")
display("Machin")
print("C")
cls()
print("D")
display("Truc")
print("E")


D


'Truc'

E


In [None]:
main_v2(version=1, verbosity=2)

In [2]:
main_v2(version=1, verbosity=2)

load C:\Users\franc\Projects\pepper_credit_scoring_tool\dataset\pqt\application_train.pqt
load C:\Users\franc\Projects\pepper_credit_scoring_tool\dataset\pqt\application_test.pqt
app


Index([], dtype='object')

[3m[32m
Preprocess `bureau` - Version 1[0m[0m
[1melapsed time[0m: 23s
[1m`bureau` shape:[0m: (305811, 112)
[1m
data shape[0m: (356255, 241)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356255 entries, 0 to 356254
Columns: 241 entries, SK_ID_CURR to PAYMENT_RATE
dtypes: bool(130), float64(70), int64(41)
memory usage: 345.9 MB
[1m
adj_table shape[0m: (305811, 112)
<class 'pandas.core.frame.DataFrame'>
Index: 305811 entries, 100001 to 456255
Columns: 112 entries, BURO_DAYS_CREDIT_MIN to CLOSED_MONTHS_BALANCE_SIZE_SUM
dtypes: float64(108), int64(4)
memory usage: 263.6 MB
[1m
updated_data shape[0m: (356255, 353)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356255 entries, 0 to 356254
Columns: 353 entries, SK_ID_CURR to CLOSED_MONTHS_BALANCE_SIZE_SUM
dtypes: bool(130), float64(182), int64(41)
memory usage: 650.3 MB
app+bur


Index([], dtype='object')

[3m[32m
Preprocess `previous_application` - Version 1[0m[0m
[1melapsed time[0m: 25s
[1m`previous_application` shape:[0m: (338857, 233)
[1m
data shape[0m: (356255, 353)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356255 entries, 0 to 356254
Columns: 353 entries, SK_ID_CURR to CLOSED_MONTHS_BALANCE_SIZE_SUM
dtypes: bool(130), float64(182), int64(41)
memory usage: 650.3 MB
[1m
adj_table shape[0m: (338857, 233)
<class 'pandas.core.frame.DataFrame'>
Index: 338857 entries, 100001 to 456255
Columns: 233 entries, PREV_AMT_ANNUITY_MIN to REFUSED_CNT_PAYMENT_SUM
dtypes: float64(229), int64(4)
memory usage: 605.0 MB
[1m
updated_data shape[0m: (356255, 586)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356255 entries, 0 to 356254
Columns: 586 entries, SK_ID_CURR to REFUSED_CNT_PAYMENT_SUM
dtypes: bool(130), float64(415), int64(41)
memory usage: 1.3 GB
app+bur+prv


Index([], dtype='object')

[3m[32m
Preprocess `pos_cash_balance` - Version 1[0m[0m
[1melapsed time[0m: 13s
[1m`pos_cash_balance` shape:[0m: (337252, 17)
[1m
data shape[0m: (356255, 586)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356255 entries, 0 to 356254
Columns: 586 entries, SK_ID_CURR to REFUSED_CNT_PAYMENT_SUM
dtypes: bool(130), float64(415), int64(41)
memory usage: 1.3 GB
[1m
adj_table shape[0m: (337252, 17)
<class 'pandas.core.frame.DataFrame'>
Index: 337252 entries, 100001 to 456255
Columns: 17 entries, POS_MONTHS_BALANCE_MAX to POS_COUNT
dtypes: float64(12), int64(5)
memory usage: 46.3 MB
[1m
updated_data shape[0m: (356255, 603)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356255 entries, 0 to 356254
Columns: 603 entries, SK_ID_CURR to POS_COUNT
dtypes: bool(130), float64(432), int64(41)
memory usage: 1.3 GB
app+bur+prv+pcb


Index([], dtype='object')

[3m[32m
Preprocess `credit_card_balance` - Version 1[0m[0m
[1melapsed time[0m: 17s
[1m`credit_card_balance` shape:[0m: (103558, 136)
[1m
data shape[0m: (356255, 603)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356255 entries, 0 to 356254
Columns: 603 entries, SK_ID_CURR to POS_COUNT
dtypes: bool(130), float64(432), int64(41)
memory usage: 1.3 GB
[1m
adj_table shape[0m: (103558, 136)
<class 'pandas.core.frame.DataFrame'>
Index: 103558 entries, 100006 to 456250
Columns: 136 entries, CC_MONTHS_BALANCE_MIN to CC_COUNT
dtypes: bool(14), float64(99), int64(23)
memory usage: 98.6 MB
[1m
updated_data shape[0m: (356255, 739)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356255 entries, 0 to 356254
Columns: 739 entries, SK_ID_CURR to CC_COUNT
dtypes: bool(130), float64(554), int64(41), object(14)
memory usage: 1.7+ GB
app+bur+prv+pcb+ccb


Index(['CC_NAME_CONTRACT_STATUS_Active_MIN',
       'CC_NAME_CONTRACT_STATUS_Active_MAX',
       'CC_NAME_CONTRACT_STATUS_Approved_MIN',
       'CC_NAME_CONTRACT_STATUS_Approved_MAX',
       'CC_NAME_CONTRACT_STATUS_Completed_MIN',
       'CC_NAME_CONTRACT_STATUS_Completed_MAX',
       'CC_NAME_CONTRACT_STATUS_Demand_MIN',
       'CC_NAME_CONTRACT_STATUS_Demand_MAX',
       'CC_NAME_CONTRACT_STATUS_Refused_MIN',
       'CC_NAME_CONTRACT_STATUS_Refused_MAX',
       'CC_NAME_CONTRACT_STATUS_Sent proposal_MIN',
       'CC_NAME_CONTRACT_STATUS_Sent proposal_MAX',
       'CC_NAME_CONTRACT_STATUS_Signed_MIN',
       'CC_NAME_CONTRACT_STATUS_Signed_MAX'],
      dtype='object')

[3m[32m
Preprocess `installments_payments` - Version 1[0m[0m
[1melapsed time[0m: 30s
[1m`installments_payments` shape:[0m: (339587, 26)
[1m
data shape[0m: (356255, 739)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356255 entries, 0 to 356254
Columns: 739 entries, SK_ID_CURR to CC_COUNT
dtypes: bool(130), float64(554), int64(41), object(14)
memory usage: 1.7+ GB
[1m
adj_table shape[0m: (339587, 26)
<class 'pandas.core.frame.DataFrame'>
Index: 339587 entries, 100001 to 456255
Columns: 26 entries, INSTAL_NUM_INSTALMENT_VERSION_NUNIQUE to INSTAL_COUNT
dtypes: float64(24), int64(2)
memory usage: 70.0 MB
[1m
updated_data shape[0m: (356255, 765)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356255 entries, 0 to 356254
Columns: 765 entries, SK_ID_CURR to INSTAL_COUNT
dtypes: bool(130), float64(580), int64(41), object(14)
memory usage: 1.7+ GB
app+bur+prv+pcb+ccb+ip


Index(['CC_NAME_CONTRACT_STATUS_Active_MIN',
       'CC_NAME_CONTRACT_STATUS_Active_MAX',
       'CC_NAME_CONTRACT_STATUS_Approved_MIN',
       'CC_NAME_CONTRACT_STATUS_Approved_MAX',
       'CC_NAME_CONTRACT_STATUS_Completed_MIN',
       'CC_NAME_CONTRACT_STATUS_Completed_MAX',
       'CC_NAME_CONTRACT_STATUS_Demand_MIN',
       'CC_NAME_CONTRACT_STATUS_Demand_MAX',
       'CC_NAME_CONTRACT_STATUS_Refused_MIN',
       'CC_NAME_CONTRACT_STATUS_Refused_MAX',
       'CC_NAME_CONTRACT_STATUS_Sent proposal_MIN',
       'CC_NAME_CONTRACT_STATUS_Sent proposal_MAX',
       'CC_NAME_CONTRACT_STATUS_Signed_MIN',
       'CC_NAME_CONTRACT_STATUS_Signed_MAX'],
      dtype='object')

Re-hot-encoded cats:


['CC_NAME_CONTRACT_STATUS_Active_MIN_False',
 'CC_NAME_CONTRACT_STATUS_Active_MIN_True',
 'CC_NAME_CONTRACT_STATUS_Active_MIN_nan',
 'CC_NAME_CONTRACT_STATUS_Active_MAX_False',
 'CC_NAME_CONTRACT_STATUS_Active_MAX_True',
 'CC_NAME_CONTRACT_STATUS_Active_MAX_nan',
 'CC_NAME_CONTRACT_STATUS_Approved_MIN_False',
 'CC_NAME_CONTRACT_STATUS_Approved_MIN_nan',
 'CC_NAME_CONTRACT_STATUS_Approved_MAX_False',
 'CC_NAME_CONTRACT_STATUS_Approved_MAX_True',
 'CC_NAME_CONTRACT_STATUS_Approved_MAX_nan',
 'CC_NAME_CONTRACT_STATUS_Completed_MIN_False',
 'CC_NAME_CONTRACT_STATUS_Completed_MIN_True',
 'CC_NAME_CONTRACT_STATUS_Completed_MIN_nan',
 'CC_NAME_CONTRACT_STATUS_Completed_MAX_False',
 'CC_NAME_CONTRACT_STATUS_Completed_MAX_True',
 'CC_NAME_CONTRACT_STATUS_Completed_MAX_nan',
 'CC_NAME_CONTRACT_STATUS_Demand_MIN_False',
 'CC_NAME_CONTRACT_STATUS_Demand_MIN_True',
 'CC_NAME_CONTRACT_STATUS_Demand_MIN_nan',
 'CC_NAME_CONTRACT_STATUS_Demand_MAX_False',
 'CC_NAME_CONTRACT_STATUS_Demand_MAX_True',
 'C

[3m[32m
Run LightGBM with kfold - Version 1[0m[0m
Starting LightGBM. Train shape: (307511, 790), test shape: (48744, 790)




[200]	training's auc: 0.796129	training's binary_logloss: 0.235405	valid_1's auc: 0.776406	valid_1's binary_logloss: 0.244073
[400]	training's auc: 0.819367	training's binary_logloss: 0.225688	valid_1's auc: 0.787153	valid_1's binary_logloss: 0.239587
[600]	training's auc: 0.834669	training's binary_logloss: 0.21945	valid_1's auc: 0.790221	valid_1's binary_logloss: 0.238259
[800]	training's auc: 0.846851	training's binary_logloss: 0.214426	valid_1's auc: 0.790854	valid_1's binary_logloss: 0.237926
[1000]	training's auc: 0.857431	training's binary_logloss: 0.209955	valid_1's auc: 0.790987	valid_1's binary_logloss: 0.23781
Fold  1AUC : 0.791202




[200]	training's auc: 0.796982	training's binary_logloss: 0.235832	valid_1's auc: 0.771713	valid_1's binary_logloss: 0.239503
[400]	training's auc: 0.819513	training's binary_logloss: 0.226194	valid_1's auc: 0.782979	valid_1's binary_logloss: 0.235243
[600]	training's auc: 0.834717	training's binary_logloss: 0.219979	valid_1's auc: 0.786855	valid_1's binary_logloss: 0.233773
[800]	training's auc: 0.846506	training's binary_logloss: 0.215063	valid_1's auc: 0.788798	valid_1's binary_logloss: 0.23299
[1000]	training's auc: 0.857088	training's binary_logloss: 0.210641	valid_1's auc: 0.789691	valid_1's binary_logloss: 0.232622
[1200]	training's auc: 0.866377	training's binary_logloss: 0.206599	valid_1's auc: 0.790329	valid_1's binary_logloss: 0.232376
[1400]	training's auc: 0.87469	training's binary_logloss: 0.202865	valid_1's auc: 0.790831	valid_1's binary_logloss: 0.232168
[1600]	training's auc: 0.882675	training's binary_logloss: 0.199154	valid_1's auc: 0.791229	valid_1's binary_logloss:



[200]	training's auc: 0.796872	training's binary_logloss: 0.235592	valid_1's auc: 0.773494	valid_1's binary_logloss: 0.241988
[400]	training's auc: 0.819731	training's binary_logloss: 0.225886	valid_1's auc: 0.782809	valid_1's binary_logloss: 0.238029
[600]	training's auc: 0.835234	training's binary_logloss: 0.219564	valid_1's auc: 0.786189	valid_1's binary_logloss: 0.236851
[800]	training's auc: 0.847328	training's binary_logloss: 0.214601	valid_1's auc: 0.787304	valid_1's binary_logloss: 0.23651
[1000]	training's auc: 0.857806	training's binary_logloss: 0.210183	valid_1's auc: 0.787638	valid_1's binary_logloss: 0.236353
[1200]	training's auc: 0.867206	training's binary_logloss: 0.206031	valid_1's auc: 0.787876	valid_1's binary_logloss: 0.236255
Fold  3AUC : 0.787974




[200]	training's auc: 0.796517	training's binary_logloss: 0.235758	valid_1's auc: 0.778316	valid_1's binary_logloss: 0.240596
[400]	training's auc: 0.819335	training's binary_logloss: 0.226075	valid_1's auc: 0.789318	valid_1's binary_logloss: 0.236225
[600]	training's auc: 0.834334	training's binary_logloss: 0.219895	valid_1's auc: 0.794046	valid_1's binary_logloss: 0.234581
[800]	training's auc: 0.846701	training's binary_logloss: 0.214833	valid_1's auc: 0.795875	valid_1's binary_logloss: 0.233912
[1000]	training's auc: 0.857143	training's binary_logloss: 0.210452	valid_1's auc: 0.79671	valid_1's binary_logloss: 0.23359
[1200]	training's auc: 0.866817	training's binary_logloss: 0.206236	valid_1's auc: 0.797256	valid_1's binary_logloss: 0.233404
[1400]	training's auc: 0.875263	training's binary_logloss: 0.202422	valid_1's auc: 0.797511	valid_1's binary_logloss: 0.23331
[1600]	training's auc: 0.883205	training's binary_logloss: 0.198708	valid_1's auc: 0.797869	valid_1's binary_logloss: 



[200]	training's auc: 0.796854	training's binary_logloss: 0.235138	valid_1's auc: 0.777641	valid_1's binary_logloss: 0.245884
[400]	training's auc: 0.819683	training's binary_logloss: 0.22545	valid_1's auc: 0.787108	valid_1's binary_logloss: 0.241673
[600]	training's auc: 0.834723	training's binary_logloss: 0.219305	valid_1's auc: 0.790719	valid_1's binary_logloss: 0.240242
[800]	training's auc: 0.846646	training's binary_logloss: 0.214376	valid_1's auc: 0.792381	valid_1's binary_logloss: 0.239642
[1000]	training's auc: 0.856803	training's binary_logloss: 0.210048	valid_1's auc: 0.793177	valid_1's binary_logloss: 0.239363
[1200]	training's auc: 0.866384	training's binary_logloss: 0.205906	valid_1's auc: 0.793968	valid_1's binary_logloss: 0.239113
[1400]	training's auc: 0.874895	training's binary_logloss: 0.202052	valid_1's auc: 0.794514	valid_1's binary_logloss: 0.238958
[1600]	training's auc: 0.882992	training's binary_logloss: 0.198289	valid_1's auc: 0.794744	valid_1's binary_logloss



[200]	training's auc: 0.797674	training's binary_logloss: 0.234889	valid_1's auc: 0.771176	valid_1's binary_logloss: 0.246954
[400]	training's auc: 0.820257	training's binary_logloss: 0.225192	valid_1's auc: 0.7808	valid_1's binary_logloss: 0.243275
[600]	training's auc: 0.835372	training's binary_logloss: 0.218949	valid_1's auc: 0.784102	valid_1's binary_logloss: 0.242142
[800]	training's auc: 0.847436	training's binary_logloss: 0.213959	valid_1's auc: 0.785199	valid_1's binary_logloss: 0.241762
[1000]	training's auc: 0.857558	training's binary_logloss: 0.209607	valid_1's auc: 0.785809	valid_1's binary_logloss: 0.241565
[1200]	training's auc: 0.866773	training's binary_logloss: 0.205584	valid_1's auc: 0.786213	valid_1's binary_logloss: 0.241378
[1400]	training's auc: 0.875369	training's binary_logloss: 0.201714	valid_1's auc: 0.786594	valid_1's binary_logloss: 0.241279
[1600]	training's auc: 0.883347	training's binary_logloss: 0.197969	valid_1's auc: 0.786964	valid_1's binary_logloss:



[200]	training's auc: 0.797388	training's binary_logloss: 0.235814	valid_1's auc: 0.771471	valid_1's binary_logloss: 0.239275
[400]	training's auc: 0.819896	training's binary_logloss: 0.226182	valid_1's auc: 0.7815	valid_1's binary_logloss: 0.235332
[600]	training's auc: 0.834412	training's binary_logloss: 0.220177	valid_1's auc: 0.784986	valid_1's binary_logloss: 0.234076
[800]	training's auc: 0.846488	training's binary_logloss: 0.215162	valid_1's auc: 0.786989	valid_1's binary_logloss: 0.233423
[1000]	training's auc: 0.856975	training's binary_logloss: 0.210705	valid_1's auc: 0.788064	valid_1's binary_logloss: 0.233109
[1200]	training's auc: 0.866071	training's binary_logloss: 0.206712	valid_1's auc: 0.788755	valid_1's binary_logloss: 0.232891
[1400]	training's auc: 0.875238	training's binary_logloss: 0.202628	valid_1's auc: 0.788887	valid_1's binary_logloss: 0.2328
[1600]	training's auc: 0.883077	training's binary_logloss: 0.198931	valid_1's auc: 0.788988	valid_1's binary_logloss: 0



[200]	training's auc: 0.797024	training's binary_logloss: 0.23557	valid_1's auc: 0.773027	valid_1's binary_logloss: 0.242238
[400]	training's auc: 0.819732	training's binary_logloss: 0.225939	valid_1's auc: 0.783602	valid_1's binary_logloss: 0.2379
[600]	training's auc: 0.83462	training's binary_logloss: 0.219808	valid_1's auc: 0.787808	valid_1's binary_logloss: 0.236302
[800]	training's auc: 0.846449	training's binary_logloss: 0.214896	valid_1's auc: 0.789358	valid_1's binary_logloss: 0.235639
[1000]	training's auc: 0.856827	training's binary_logloss: 0.210485	valid_1's auc: 0.790015	valid_1's binary_logloss: 0.235355
[1200]	training's auc: 0.866044	training's binary_logloss: 0.206411	valid_1's auc: 0.790379	valid_1's binary_logloss: 0.235192
[1400]	training's auc: 0.874435	training's binary_logloss: 0.202608	valid_1's auc: 0.790842	valid_1's binary_logloss: 0.235084
[1600]	training's auc: 0.882184	training's binary_logloss: 0.199023	valid_1's auc: 0.79094	valid_1's binary_logloss: 0.



[200]	training's auc: 0.79684	training's binary_logloss: 0.235216	valid_1's auc: 0.775433	valid_1's binary_logloss: 0.245385
[400]	training's auc: 0.819465	training's binary_logloss: 0.225635	valid_1's auc: 0.785872	valid_1's binary_logloss: 0.241102
[600]	training's auc: 0.83428	training's binary_logloss: 0.219459	valid_1's auc: 0.790024	valid_1's binary_logloss: 0.239685
[800]	training's auc: 0.846372	training's binary_logloss: 0.214496	valid_1's auc: 0.791795	valid_1's binary_logloss: 0.239022
[1000]	training's auc: 0.856687	training's binary_logloss: 0.210121	valid_1's auc: 0.792456	valid_1's binary_logloss: 0.238785
[1200]	training's auc: 0.866074	training's binary_logloss: 0.206004	valid_1's auc: 0.792755	valid_1's binary_logloss: 0.238628
[1400]	training's auc: 0.874823	training's binary_logloss: 0.202125	valid_1's auc: 0.792789	valid_1's binary_logloss: 0.238582
Fold  9AUC : 0.793078




[200]	training's auc: 0.796642	training's binary_logloss: 0.236064	valid_1's auc: 0.774787	valid_1's binary_logloss: 0.238156
[400]	training's auc: 0.819497	training's binary_logloss: 0.2264	valid_1's auc: 0.784789	valid_1's binary_logloss: 0.234051
[600]	training's auc: 0.834813	training's binary_logloss: 0.220114	valid_1's auc: 0.78812	valid_1's binary_logloss: 0.232658
[800]	training's auc: 0.847119	training's binary_logloss: 0.215081	valid_1's auc: 0.789574	valid_1's binary_logloss: 0.232125
[1000]	training's auc: 0.857615	training's binary_logloss: 0.210632	valid_1's auc: 0.790117	valid_1's binary_logloss: 0.231924
[1200]	training's auc: 0.866916	training's binary_logloss: 0.206513	valid_1's auc: 0.790417	valid_1's binary_logloss: 0.231808
[1400]	training's auc: 0.87556	training's binary_logloss: 0.202574	valid_1's auc: 0.790696	valid_1's binary_logloss: 0.231718
[1600]	training's auc: 0.883663	training's binary_logloss: 0.198768	valid_1's auc: 0.791212	valid_1's binary_logloss: 0

Traceback (most recent call last):
  File "C:\Users\franc\AppData\Roaming\Python\Python311\site-packages\IPython\core\interactiveshell.py", line 3505, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "C:\Users\franc\AppData\Local\Temp\ipykernel_6872\3426908498.py", line 1, in <module>
    main_v2(version=1, verbosity=2)
  File "C:\Users\franc\Projects\pepper_credit_scoring_tool\modules\home_credit\lightgbm_kernel_v2.py", line 1613, in main
  File "C:\Users\franc\Projects\pepper_credit_scoring_tool\modules\home_credit\lightgbm_kernel_v2.py", line 1484, in kfold_lightgbm
    "Do not support special JSON characters in feature name"
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\franc\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\metrics\_ranking.py", line 572, in roc_auc_score
    return _average_binary_score(
           ^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\franc\AppData\Local\Programs\Python\Python311\Lib\site

# Annexes

Le but est de comprendre et d'améliorer techniquement la version de référence.

Nous progressons donc par petites étapes, à partir du chargement de la table.

## Comparaison des versions

Démonstration (qui a également servi aux tests et à la mise au point) des opérations pour comparer les versions d'origine et modifiée.

### Alignement

Les dataframes à comparer peuvent être égaux à une permutation près des lignes ou des colonnes.

L'utilitaire suivante permet de réaligner nos dataframes souvents triés et réindexés avec les versions brutes telles qu'elles sont chargées par le kernel d'origine.

In [None]:
from pepper.pd_utils import align_df2_on_df1
app_v1 = load_application_v1()
app_v2 = align_df2_on_df1("SK_ID_CURR", app_v1, load_application_v2())
display(app_v1)
display(app_v2)

load C:\Users\franc\Projects\pepper_credit_scoring_tool\dataset\pqt\application_train.pqt
load C:\Users\franc\Projects\pepper_credit_scoring_tool\dataset\pqt\application_test.pqt


Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1.0,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0.0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0.0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0.0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0.0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
356250,456221,,Cash loans,F,N,Y,0,121500.0,412560.0,17473.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
356251,456222,,Cash loans,F,N,N,2,157500.0,622413.0,31909.5,...,0,0,0,0,,,,,,
356252,456223,,Cash loans,F,Y,Y,1,202500.0,315000.0,33205.5,...,0,0,0,0,0.0,0.0,0.0,0.0,3.0,1.0
356253,456224,,Cash loans,M,N,N,0,225000.0,450000.0,25128.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,2.0


application,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1.0,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0.0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0.0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0.0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0.0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
356250,456221,,Cash loans,F,N,Y,0,121500.0,412560.0,17473.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
356251,456222,,Cash loans,F,N,N,2,157500.0,622413.0,31909.5,...,0,0,0,0,,,,,,
356252,456223,,Cash loans,F,Y,Y,1,202500.0,315000.0,33205.5,...,0,0,0,0,0.0,0.0,0.0,0.0,3.0,1.0
356253,456224,,Cash loans,M,N,N,0,225000.0,450000.0,25128.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,2.0


### Masque de comparaison

`pepper.pd_utils` contient deux fonctions `df_eq` et `df_neq` qui évitent le piège du `x == x` qui retourne `False` si `x` est NA. Ensuite, il faut utiliser `all` et `any` pour en tirer parti.

In [None]:
from pepper.pd_utils import df_neq
is_diff = df_neq(app_v1, app_v2)
print("n_diffs:", is_diff.sum().sum())
print("n_diffs by cols:\n", is_diff.sum(), sep="")
print("n_diffs by rows:\n", is_diff.sum(axis=1), sep="")
display(is_diff.any())
display(is_diff.any(axis=1))

n_diffs: 0
n_diffs by cols:
SK_ID_CURR                    0
TARGET                        0
NAME_CONTRACT_TYPE            0
CODE_GENDER                   0
FLAG_OWN_CAR                  0
                             ..
AMT_REQ_CREDIT_BUREAU_DAY     0
AMT_REQ_CREDIT_BUREAU_WEEK    0
AMT_REQ_CREDIT_BUREAU_MON     0
AMT_REQ_CREDIT_BUREAU_QRT     0
AMT_REQ_CREDIT_BUREAU_YEAR    0
Length: 122, dtype: int64
n_diffs by rows:
0         0
1         0
2         0
3         0
4         0
         ..
356250    0
356251    0
356252    0
356253    0
356254    0
Length: 356255, dtype: int64


SK_ID_CURR                    False
TARGET                        False
NAME_CONTRACT_TYPE            False
CODE_GENDER                   False
FLAG_OWN_CAR                  False
                              ...  
AMT_REQ_CREDIT_BUREAU_DAY     False
AMT_REQ_CREDIT_BUREAU_WEEK    False
AMT_REQ_CREDIT_BUREAU_MON     False
AMT_REQ_CREDIT_BUREAU_QRT     False
AMT_REQ_CREDIT_BUREAU_YEAR    False
Length: 122, dtype: bool

0         False
1         False
2         False
3         False
4         False
          ...  
356250    False
356251    False
356252    False
356253    False
356254    False
Length: 356255, dtype: bool

### Détection des variations locales de `dtype`

In [None]:
from pepper.pd_utils import check_dtypes_alignment
check_dtypes_alignment(app_v1, app_v2)

dtypes are aligned


### Caculer la différence (la distance) entre les coefficients

In [None]:
from pepper.pd_utils import safe_diff_series
display(safe_diff_series(app_v1.SK_ID_CURR, app_v2.SK_ID_CURR).sum())
display(safe_diff_series(app_v1.TARGET, app_v2.TARGET).sum())

0

0.0

In [None]:
display(app_v1[is_diff.TARGET].TARGET)
display(app_v2[is_diff.TARGET].TARGET)

Series([], Name: TARGET, dtype: float64)

Series([], Name: TARGET, dtype: float64)

In [None]:
from pepper.pd_utils import safe_diff_dataframe
diff = safe_diff_dataframe(app_v1, app_v2)

Ici, finition : en cas de diff, filtrer facilement des plages de la matrice des différences.

In [None]:
display(diff.loc[is_diff.any(axis=1), is_diff.any()])

Unnamed: 0,TARGET,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,OWN_CAR_AGE,OCCUPATION_TYPE,CNT_FAM_MEMBERS,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,...,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,0.0,0.0,0.0,,,,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,,,,0.0,0.0,0.0,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,,0.0,,0.0,,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,,,,0.0,,0.0,,...,0.0,0.0,0.0,0.0,,,,,,
4,0.0,0.0,0.0,,,,0.0,,0.0,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
356250,,0.0,0.0,,,,0.0,,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
356251,,0.0,0.0,,,,0.0,,0.0,,...,0.0,0.0,0.0,0.0,,,,,,
356252,,0.0,0.0,,0.0,,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
356253,,0.0,0.0,,,,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Hot encoding

Les tables `application_{train|test}` ont typiquement de nombreuses variables catégorielles.

Le critère choisi par la version de référence de `dtype == object` est discutable.

Nous aurons intérêt à encoder à chaud, par exemple, les variables binaires dont le `dtype` est `int`.

Les variables entières avec un nombre restreint de modalités auront également, pour nombre d'entre elles, intérêt à être considérées comme des catégories, y compris si elles représentent des cardinaux (nombre d'ascenseurs, nombre d'enfants par exemple), et pas seulement dans le cas des ordinaux (heure de la journée par exemple).

Notre amélioration va donc se concentrer sur la sélection des variables considérées comme catégorielles, avec une sélection par défaut isofonctionnelle au noyau de référence.

In [None]:
import pandas as pd

# One-hot encoding for categorical columns with get_dummies
def one_hot_encoder(df, nan_as_category=True):
    original_columns = list(df.columns)
    categorical_columns = [col for col in df.columns if df[col].dtype == 'object']
    df = pd.get_dummies(df, columns=categorical_columns, dummy_na=nan_as_category)
    new_columns = [c for c in df.columns if c not in original_columns]
    return df, new_columns

### Sélection des variables catégorielles

In [None]:
from home_credit.load import get_application_train
df = get_application_train()
display(df.head(3))

application_train,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
original_columns = list(df.columns)
categorical_columns = [col for col in df.columns if df[col].dtype == 'object']
display(categorical_columns)

['NAME_CONTRACT_TYPE',
 'CODE_GENDER',
 'FLAG_OWN_CAR',
 'FLAG_OWN_REALTY',
 'NAME_TYPE_SUITE',
 'NAME_INCOME_TYPE',
 'NAME_EDUCATION_TYPE',
 'NAME_FAMILY_STATUS',
 'NAME_HOUSING_TYPE',
 'OCCUPATION_TYPE',
 'WEEKDAY_APPR_PROCESS_START',
 'ORGANIZATION_TYPE',
 'FONDKAPREMONT_MODE',
 'HOUSETYPE_MODE',
 'WALLSMATERIAL_MODE',
 'EMERGENCYSTATE_MODE']

In [None]:
from home_credit.lightgbm_kernel_v2 import get_categorical_vars
display(get_categorical_vars(df))
display(get_categorical_vars(df, dtype=None, max_modalities=2))

['NAME_CONTRACT_TYPE',
 'CODE_GENDER',
 'FLAG_OWN_CAR',
 'FLAG_OWN_REALTY',
 'NAME_TYPE_SUITE',
 'NAME_INCOME_TYPE',
 'NAME_EDUCATION_TYPE',
 'NAME_FAMILY_STATUS',
 'NAME_HOUSING_TYPE',
 'OCCUPATION_TYPE',
 'WEEKDAY_APPR_PROCESS_START',
 'ORGANIZATION_TYPE',
 'FONDKAPREMONT_MODE',
 'HOUSETYPE_MODE',
 'WALLSMATERIAL_MODE',
 'EMERGENCYSTATE_MODE']

['TARGET',
 'NAME_CONTRACT_TYPE',
 'FLAG_OWN_CAR',
 'FLAG_OWN_REALTY',
 'FLAG_MOBIL',
 'FLAG_EMP_PHONE',
 'FLAG_WORK_PHONE',
 'FLAG_CONT_MOBILE',
 'FLAG_PHONE',
 'FLAG_EMAIL',
 'REG_REGION_NOT_LIVE_REGION',
 'REG_REGION_NOT_WORK_REGION',
 'LIVE_REGION_NOT_WORK_REGION',
 'REG_CITY_NOT_LIVE_CITY',
 'REG_CITY_NOT_WORK_CITY',
 'LIVE_CITY_NOT_WORK_CITY',
 'EMERGENCYSTATE_MODE',
 'FLAG_DOCUMENT_2',
 'FLAG_DOCUMENT_3',
 'FLAG_DOCUMENT_4',
 'FLAG_DOCUMENT_5',
 'FLAG_DOCUMENT_6',
 'FLAG_DOCUMENT_7',
 'FLAG_DOCUMENT_8',
 'FLAG_DOCUMENT_9',
 'FLAG_DOCUMENT_10',
 'FLAG_DOCUMENT_11',
 'FLAG_DOCUMENT_12',
 'FLAG_DOCUMENT_13',
 'FLAG_DOCUMENT_14',
 'FLAG_DOCUMENT_15',
 'FLAG_DOCUMENT_16',
 'FLAG_DOCUMENT_17',
 'FLAG_DOCUMENT_18',
 'FLAG_DOCUMENT_19',
 'FLAG_DOCUMENT_20',
 'FLAG_DOCUMENT_21']

### Hot encoding avec `get_dummies`

Voir la documentation utilisateur de Pandas 2.0 : https://pandas.pydata.org/docs/user_guide/reshaping.html#reshaping-dummies

Elle illustre notamment l'utilisation conjointe avec `cut`.

`get_dummy` peut produire un tableau dense (par défaut) ou creux (voir https://pandas.pydata.org/docs/reference/api/pandas.arrays.SparseArray.html).

#### Comment ça marche ?

In [None]:
import pandas as pd
a = df.CODE_GENDER
b = pd.get_dummies(a)
c = a.str.get_dummies()
display(pd.concat([a, b, c], axis=1).head(3))

Unnamed: 0,CODE_GENDER,F,M,XNA,F.1,M.1,XNA.1
0,M,False,True,False,0,1,0
1,F,True,False,False,1,0,0
2,M,False,True,False,0,1,0


In [None]:
x = df[["NAME_CONTRACT_TYPE", "CODE_GENDER"]]
y = pd.get_dummies(x, dummy_na=True)
display(pd.concat([x, y], axis=1).head(3))

Unnamed: 0,NAME_CONTRACT_TYPE,CODE_GENDER,NAME_CONTRACT_TYPE_Cash loans,NAME_CONTRACT_TYPE_Revolving loans,NAME_CONTRACT_TYPE_nan,CODE_GENDER_F,CODE_GENDER_M,CODE_GENDER_XNA,CODE_GENDER_nan
0,Cash loans,M,True,False,False,False,True,False,False
1,Cash loans,F,True,False,False,True,False,False,False
2,Revolving loans,M,False,True,False,False,True,False,False


#### Le cas des NA

In [None]:
# deux pb avec les NA :
# 1/ leur codage spécial => convertir en vrai NA
# 2/ si dummy_na=True mais qu'il n'y en a pas => une colonne pour rien
# la soluce : générer puis supprimer les constantes
display(a.value_counts(dropna=False))
display(y.CODE_GENDER_XNA.value_counts(dropna=False))
display(y.CODE_GENDER_nan.value_counts(dropna=False))

CODE_GENDER
F      202448
M      105059
XNA         4
Name: count, dtype: int64

CODE_GENDER_XNA
False    307507
True          4
Name: count, dtype: int64

CODE_GENDER_nan
False    307511
Name: count, dtype: int64

#### Suppression des colonnes constantes (non NA)

In [None]:
z = y.apply(pd.Series.nunique)
# Cette version générale est plus secure,
# mais celle qui précède est suffisante dans le contexte d'utilisation
# z = y.apply(lambda s: pd.Series.nunique(s, dropna=False))
print(list(z[z == 1].index))
truc = y.drop(columns=z[z == 1].index)
display(truc)

['NAME_CONTRACT_TYPE_nan', 'CODE_GENDER_nan']


Unnamed: 0,NAME_CONTRACT_TYPE_Cash loans,NAME_CONTRACT_TYPE_Revolving loans,CODE_GENDER_F,CODE_GENDER_M,CODE_GENDER_XNA
0,True,False,False,True,False
1,True,False,True,False,False
2,False,True,False,True,False
3,True,False,True,False,False
4,True,False,False,True,False
...,...,...,...,...,...
307506,True,False,False,True,False
307507,True,False,True,False,False
307508,True,False,True,False,False
307509,True,False,True,False,False


#### Intérêt de `drop_first`

N'oublions pas que près de 20 % des variables sont binaires.

Souhaitons en faire 40 ou bien 80 colonnes ?

In [None]:
# drop first : pertinent par exemple pour qu'une variable binaire ne donne pas deux colonnes
# ces deux colonnes seraient parfaitement anti-corrélées, donc corrélées
# logiquement, c'est l'un des premiers trucs qu'élimine une réduction de dimensionnalité
# le vérifier, en le conservant comme une option par défaut à True (à l'opposé de la valeur par défaut)
x = df[["NAME_CONTRACT_TYPE", "CODE_GENDER"]]
y = pd.get_dummies(x, dummy_na=True, drop_first=True)
display(pd.concat([x, y], axis=1).head(3))
m = y.memory_usage(deep=True)
print(m)
print(m.sum())

Unnamed: 0,NAME_CONTRACT_TYPE,CODE_GENDER,NAME_CONTRACT_TYPE_Revolving loans,NAME_CONTRACT_TYPE_nan,CODE_GENDER_M,CODE_GENDER_XNA,CODE_GENDER_nan
0,Cash loans,M,False,False,True,False,False
1,Cash loans,F,False,False,False,False,False
2,Revolving loans,M,True,False,True,False,False


Index                                    128
NAME_CONTRACT_TYPE_Revolving loans    307511
NAME_CONTRACT_TYPE_nan                307511
CODE_GENDER_M                         307511
CODE_GENDER_XNA                       307511
CODE_GENDER_nan                       307511
dtype: int64
1537683


#### Question de l'empreinte mémoire

Faut-il des bools, des entiers int8, une matrice creuse, ... ?

In [None]:
import numpy as np
x = df[["NAME_CONTRACT_TYPE", "CODE_GENDER"]]
y = pd.get_dummies(x, dummy_na=True, drop_first=True, dtype=np.int8)
display(pd.concat([x, y], axis=1).head(3))
m = y.memory_usage(deep=True)
print(m)
print(m.sum())

Unnamed: 0,NAME_CONTRACT_TYPE,CODE_GENDER,NAME_CONTRACT_TYPE_Revolving loans,NAME_CONTRACT_TYPE_nan,CODE_GENDER_M,CODE_GENDER_XNA,CODE_GENDER_nan
0,Cash loans,M,0,0,1,0,0
1,Cash loans,F,0,0,0,0,0
2,Revolving loans,M,1,0,1,0,0


Index                                    128
NAME_CONTRACT_TYPE_Revolving loans    307511
NAME_CONTRACT_TYPE_nan                307511
CODE_GENDER_M                         307511
CODE_GENDER_XNA                       307511
CODE_GENDER_nan                       307511
dtype: int64
1537683


In [None]:
import numpy as np
x = df[["NAME_CONTRACT_TYPE", "CODE_GENDER"]]
y = pd.get_dummies(x, dummy_na=True, drop_first=True, dtype=np.int8, sparse=True)
display(pd.concat([x, y], axis=1).head(3))
m = y.memory_usage(deep=True)
print(m)
print(m.sum())

Unnamed: 0,NAME_CONTRACT_TYPE,CODE_GENDER,NAME_CONTRACT_TYPE_Revolving loans,NAME_CONTRACT_TYPE_nan,CODE_GENDER_M,CODE_GENDER_XNA,CODE_GENDER_nan
0,Cash loans,M,0,0,1,0,0
1,Cash loans,F,0,0,0,0,0
2,Revolving loans,M,1,0,1,0,0


Index                                    128
NAME_CONTRACT_TYPE_Revolving loans    146395
NAME_CONTRACT_TYPE_nan                     0
CODE_GENDER_M                         525295
CODE_GENDER_XNA                           20
CODE_GENDER_nan                            0
dtype: int64
671838


#### Pourquoi ce nom inapproprié `get_dummies` ?

Echange avec chatGPT :

*Je me demandais pourquoi la fonction get_dummies de Pandas, qui effectue un hot encoding s'appelle comme elle s'appelle. En d'autres termes, quelle histoire à justifié ce nom éloigné de ce qu'elle fait ?*

En effet, le nom "get_dummies" de la fonction de Pandas peut sembler étrange et éloigné de l'encodage one-hot qu'elle effectue. La raison de ce nom remonte à l'origine de la fonctionnalité qui a été implémentée pour la première fois dans le langage de programmation R. Dans R, la fonction équivalente s'appelle "dummy variable", d'où le terme "dummies" dans le nom de la fonction de Pandas.

Le terme "dummy variable" a été introduit dans la théorie de la régression linéaire pour représenter des variables catégorielles qui ne peuvent pas être directement intégrées dans un modèle de régression linéaire. À la place, ces variables catégorielles sont remplacées par un ensemble de variables binaires qui indiquent la présence ou l'absence de chaque catégorie. Ces variables binaires sont souvent appelées "dummy variables", d'où le nom de la fonction dans R et, par extension, de la fonction de Pandas.

En résumé, le nom "get_dummies" de la fonction de Pandas est un héritage du nom de la fonction équivalente dans le langage R, qui lui-même fait référence aux variables binaires qui représentent les catégories dans un modèle de régression linéaire.

#### Version intégrée de `one_hot_encode_all_cats` :

**NB** Il reste, en amont dans les premiers prétraitements, à identifier les NA encodés maison par HomeCredit et à les convertir en vrais NA.

In [None]:
def one_hot_encode_all_cats(
    df,
    columns=None,
    dummy_na=True,
    drop_first=True,
    dtype=np.int8,
    sparse=True
):
    ohe_df = pd.get_dummies(
        df, columns=columns, dummy_na=dummy_na,
        drop_first=drop_first, dtype=dtype, sparse=sparse
    )
    # Supression des colonnes constantes éventuellement produites par `dummy_na`
    const_cols = ohe_df.apply(pd.Series.nunique)
    const_cols = const_cols[const_cols == 1]
    ohe_df.drop(columns=const_cols.index, inplace=True)
    return ohe_df


ohe_df = one_hot_encode_all_cats(df[["NAME_CONTRACT_TYPE", "CODE_GENDER"]])
display(ohe_df)


Unnamed: 0,NAME_CONTRACT_TYPE_Revolving loans,CODE_GENDER_M,CODE_GENDER_XNA
0,0,1,0
1,0,0,0
2,1,1,0
3,0,0,0
4,0,1,0
...,...,...,...
307506,0,1,0
307507,0,0,0
307508,0,0,0
307509,0,0,0


## Caractéristiques dérivées : de l'intérêt de `eval`

L'ingénierie des caractéristiques a notamment pour objectif de produire des caractéristiques dérivées.

Dans ce contexte, `eval` permet un gain de lisibilité du code et de performance d'exécution.

[**Documentation d'utilisation de `pandas eval` pour améliorer les performances**](https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html#enhancingperf-eval).

### Comment ça marche ?

```python
    ins['PAYMENT_PERC'] = ins['AMT_PAYMENT'] / ins['AMT_INSTALMENT']
    ins['PAYMENT_DIFF'] = ins['AMT_INSTALMENT'] - ins['AMT_PAYMENT']
```

In [None]:
from home_credit.load import get_installments_payments
df = get_installments_payments()
display(df)

In [None]:
import time

#amt_payment = df.AMT_PAYMENT
#amt_installment = df.AMT_INSTALMENT
t = -time.time()
df['PAYMENT_PERC'] = df['AMT_PAYMENT'] / df['AMT_INSTALMENT']
t += time.time()
print(t)

0.056481361389160156


In [None]:
import time
import pandas as pd

t = -time.time()
#df['PAYMENT_PERC_2'] = df['AMT_PAYMENT'] / df['AMT_INSTALMENT']
df.eval("PAYMENT_PERC_2 = AMT_PAYMENT / AMT_INSTALMENT", inplace=True, engine="python")
t += time.time()
print(t)

0.16107678413391113


In [None]:
import time
import pandas as pd

t = -time.time()
#df['PAYMENT_PERC_2'] = df['AMT_PAYMENT'] / df['AMT_INSTALMENT']
df.eval("PAYMENT_PERC_2 = AMT_PAYMENT / AMT_INSTALMENT", inplace=True, engine="numexpr")
t += time.time()
print(t)

0.13232922554016113


### Pandas assign : ce n'est pas une bonne alternative

In [None]:
import time
import pandas as pd

t = -time.time()
#df['PAYMENT_PERC_2'] = df['AMT_PAYMENT'] / df['AMT_INSTALMENT']
df.assign(PAYMENT_PERC_3=df.AMT_PAYMENT / df.AMT_INSTALMENT)
t += time.time()
print(t)

0.8159232139587402


### Tout en une seule fois

#### Cf. kernel

In [None]:
from home_credit.load import _load_installments_payments
import time

df_1 = _load_installments_payments()

t = -time.time()
# Percentage and difference paid in each installment (amount paid and installment value)
df_1['PAYMENT_PERC'] = df_1['AMT_PAYMENT'] / df_1['AMT_INSTALMENT']
df_1['PAYMENT_DIFF'] = df_1['AMT_INSTALMENT'] - df_1['AMT_PAYMENT']
# Days past due and days before due (no negative values)
df_1['DPD'] = df_1['DAYS_ENTRY_PAYMENT'] - df_1['DAYS_INSTALMENT']
df_1['DBD'] = df_1['DAYS_INSTALMENT'] - df_1['DAYS_ENTRY_PAYMENT']
df_1['DPD'] = df_1['DPD'].apply(lambda x: x if x > 0 else 0)
df_1['DBD'] = df_1['DBD'].apply(lambda x: x if x > 0 else 0)
t += time.time()
print(t)

load C:\Users\franc\Projects\pepper_credit_scoring_tool\dataset\pqt\installments_payments.pqt
7.973264455795288


#### Avec eval

In [None]:
from home_credit.load import _load_installments_payments
from numpy import where
import time

df_2 = _load_installments_payments()

t = -time.time()
# ✔ Gain de lisibilité
# ✔ Gain de performance : un ordre de grandeur
df_2.eval(
    """
    PAYMENT_PERC = AMT_PAYMENT / AMT_INSTALMENT
    PAYMENT_DIFF = AMT_INSTALMENT - AMT_PAYMENT
    DPD = DAYS_ENTRY_PAYMENT - DAYS_INSTALMENT
    DBD = DAYS_INSTALMENT - DAYS_ENTRY_PAYMENT
    DPD = @where(DPD > 0, DPD, 0)
    DBD = @where(DBD > 0, DBD, 0)
    """,
    inplace=True, engine="numexpr"
)
t += time.time()
print(t)

load C:\Users\franc\Projects\pepper_credit_scoring_tool\dataset\pqt\installments_payments.pqt
0.6242649555206299


#### Compare dfs

In [None]:
# display(df_1)
# display(df_2)
print("same result:", all(df_1 == df_2))

same result: True


## `del` et `gc.collect`

**TODO** faire une démonstration imparable de ce que j'avance ci-après :

Dans la plupart des fonctions, le dataframe de travail chargé puis modifié est explicitement supprimé de la mémoire à l'aide de `del` suivi d'un appel explicite au *garbage collector* à l'aide  de `gc.collect()`.

On peut voir là la marque d'un programmeur Java reconverti à Python.

Cependant, ces appels sont inutiles :
1. lorsque l'exécution de la fonction se termine, la variable locale est automatiquement librée (`del` implicite).
2. si le système a besoin de mémoire par exemple à l'étape suivante du préprocessing, le garbage collector sera alors appelé sans qu'il soit besoin d'une instruction explicite.

Nous décidons donc de ne pas conserver ces instructions.

## Groupby



### xxx Old, recycle : reverse engineering de l'agrégation de `bureau`

In [None]:
from home_credit.load import get_bureau, get_bureau_balance
b = get_bureau()
bb = get_bureau_balance()

display(b.head(3))
display(bb.head(3))

bureau,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
0,215354,5714462,Closed,currency 1,-497,0,-153.0,-153.0,,0,91323.00,0.0,,0.0,Consumer credit,-131,
1,215354,5714463,Active,currency 1,-208,0,1075.0,,,0,225000.00,171342.0,,0.0,Credit card,-20,
2,215354,5714464,Active,currency 1,-203,0,528.0,,,0,464323.50,,,0.0,Consumer credit,-16,
3,215354,5714465,Active,currency 1,-203,0,,,,0,90000.00,,,0.0,Credit card,-16,
4,215354,5714466,Active,currency 1,-629,0,1197.0,,77674.5,0,2700000.00,,,0.0,Consumer credit,-21,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1716423,259355,5057750,Active,currency 1,-44,0,-30.0,,0.0,0,11250.00,11250.0,0.0,0.0,Microloan,-19,
1716424,100044,5057754,Closed,currency 1,-2648,0,-2433.0,-2493.0,5476.5,0,38130.84,0.0,0.0,0.0,Consumer credit,-2493,
1716425,100044,5057762,Closed,currency 1,-1809,0,-1628.0,-970.0,,0,15570.00,,,0.0,Consumer credit,-967,
1716426,246829,5057770,Closed,currency 1,-1878,0,-1513.0,-1513.0,,0,36000.00,0.0,0.0,0.0,Consumer credit,-1508,


bureau_balance,SK_ID_BUREAU,MONTHS_BALANCE,STATUS
0,5715448,0,C
1,5715448,-1,C
2,5715448,-2,C
3,5715448,-3,C
4,5715448,-4,C
...,...,...,...
27299920,5041336,-47,X
27299921,5041336,-48,X
27299922,5041336,-49,X
27299923,5041336,-50,X


In [None]:
from home_credit.lightgbm_kernel import one_hot_encoder
bb, bb_cat = one_hot_encoder(bb, True)
bureau, bureau_cat = one_hot_encoder(bureau, True)

In [None]:
display(bb_cat)
display(bureau_cat)

['STATUS_0',
 'STATUS_1',
 'STATUS_2',
 'STATUS_3',
 'STATUS_4',
 'STATUS_5',
 'STATUS_C',
 'STATUS_X',
 'STATUS_nan']

['CREDIT_ACTIVE_Active',
 'CREDIT_ACTIVE_Bad debt',
 'CREDIT_ACTIVE_Closed',
 'CREDIT_ACTIVE_Sold',
 'CREDIT_ACTIVE_nan',
 'CREDIT_CURRENCY_currency 1',
 'CREDIT_CURRENCY_currency 2',
 'CREDIT_CURRENCY_currency 3',
 'CREDIT_CURRENCY_currency 4',
 'CREDIT_CURRENCY_nan',
 'CREDIT_TYPE_Another type of loan',
 'CREDIT_TYPE_Car loan',
 'CREDIT_TYPE_Cash loan (non-earmarked)',
 'CREDIT_TYPE_Consumer credit',
 'CREDIT_TYPE_Credit card',
 'CREDIT_TYPE_Interbank credit',
 'CREDIT_TYPE_Loan for business development',
 'CREDIT_TYPE_Loan for purchase of shares (margin lending)',
 'CREDIT_TYPE_Loan for the purchase of equipment',
 'CREDIT_TYPE_Loan for working capital replenishment',
 'CREDIT_TYPE_Microloan',
 'CREDIT_TYPE_Mobile operator loan',
 'CREDIT_TYPE_Mortgage',
 'CREDIT_TYPE_Real estate loan',
 'CREDIT_TYPE_Unknown type of loan',
 'CREDIT_TYPE_nan']

In [None]:
# Bureau balance: Perform aggregations and merge with bureau.csv
bb_aggregations = {'MONTHS_BALANCE': ['min', 'max', 'size']}
for col in bb_cat:
    bb_aggregations[col] = ['mean']

L'aggrégation des lignes de `bureau_balance` :
* moyenne de chaque catégorie de STATUS
* min, max, size de MONTHS_BALANCE

Je pense que l'on peut faire mieux cf. mon pivot...

conserver les deux pour pouvoir comparer 1/ les corrélations 2/ les performances finales

In [None]:
display(bb_aggregations)

{'MONTHS_BALANCE': ['min', 'max', 'size'],
 'STATUS_0': ['mean'],
 'STATUS_1': ['mean'],
 'STATUS_2': ['mean'],
 'STATUS_3': ['mean'],
 'STATUS_4': ['mean'],
 'STATUS_5': ['mean'],
 'STATUS_C': ['mean'],
 'STATUS_X': ['mean'],
 'STATUS_nan': ['mean']}

In [None]:
import pandas as pd
bb_agg = bb.groupby('SK_ID_BUREAU').agg(bb_aggregations)
# Reduction du multi-index produit par le groupby
bb_agg.columns = pd.Index([e[0] + "_" + e[1].upper() for e in bb_agg.columns.tolist()])

In [None]:
display(bb_agg)

Unnamed: 0_level_0,MONTHS_BALANCE_MIN,MONTHS_BALANCE_MAX,MONTHS_BALANCE_SIZE,STATUS_0_MEAN,STATUS_1_MEAN,STATUS_2_MEAN,STATUS_3_MEAN,STATUS_4_MEAN,STATUS_5_MEAN,STATUS_C_MEAN,STATUS_X_MEAN,STATUS_nan_MEAN
SK_ID_BUREAU,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
5001709,-96,0,97,0.000000,0.000000,0.0,0.0,0.0,0.0,0.886598,0.113402,0.0
5001710,-82,0,83,0.060241,0.000000,0.0,0.0,0.0,0.0,0.578313,0.361446,0.0
5001711,-3,0,4,0.750000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.250000,0.0
5001712,-18,0,19,0.526316,0.000000,0.0,0.0,0.0,0.0,0.473684,0.000000,0.0
5001713,-21,0,22,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,1.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
6842884,-47,0,48,0.187500,0.000000,0.0,0.0,0.0,0.0,0.416667,0.395833,0.0
6842885,-23,0,24,0.500000,0.000000,0.0,0.0,0.0,0.5,0.000000,0.000000,0.0
6842886,-32,0,33,0.242424,0.000000,0.0,0.0,0.0,0.0,0.757576,0.000000,0.0
6842887,-36,0,37,0.162162,0.000000,0.0,0.0,0.0,0.0,0.837838,0.000000,0.0


Ici, un simple concat serait plus adapté non ? tester

In [None]:
bureau = bureau.join(bb_agg, how='left', on='SK_ID_BUREAU')
bureau.drop(['SK_ID_BUREAU'], axis=1, inplace= True)

In [None]:
display(bureau.head(3))

Unnamed: 0,SK_ID_CURR,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,DAYS_CREDIT_UPDATE,AMT_ANNUITY,CREDIT_ACTIVE_Active,CREDIT_ACTIVE_Bad debt,CREDIT_ACTIVE_Closed,CREDIT_ACTIVE_Sold,CREDIT_ACTIVE_nan,CREDIT_CURRENCY_currency 1,CREDIT_CURRENCY_currency 2,CREDIT_CURRENCY_currency 3,CREDIT_CURRENCY_currency 4,CREDIT_CURRENCY_nan,CREDIT_TYPE_Another type of loan,CREDIT_TYPE_Car loan,CREDIT_TYPE_Cash loan (non-earmarked),CREDIT_TYPE_Consumer credit,CREDIT_TYPE_Credit card,CREDIT_TYPE_Interbank credit,CREDIT_TYPE_Loan for business development,CREDIT_TYPE_Loan for purchase of shares (margin lending),CREDIT_TYPE_Loan for the purchase of equipment,CREDIT_TYPE_Loan for working capital replenishment,CREDIT_TYPE_Microloan,CREDIT_TYPE_Mobile operator loan,CREDIT_TYPE_Mortgage,CREDIT_TYPE_Real estate loan,CREDIT_TYPE_Unknown type of loan,CREDIT_TYPE_nan,MONTHS_BALANCE_MIN,MONTHS_BALANCE_MAX,MONTHS_BALANCE_SIZE,STATUS_0_MEAN,STATUS_1_MEAN,STATUS_2_MEAN,STATUS_3_MEAN,STATUS_4_MEAN,STATUS_5_MEAN,STATUS_C_MEAN,STATUS_X_MEAN,STATUS_nan_MEAN
0,215354,-497,0,-153.0,-153.0,,0,91323.0,0.0,,0.0,-131,,False,False,True,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,,,,,,,,,,,,
1,215354,-208,0,1075.0,,,0,225000.0,171342.0,,0.0,-20,,True,False,False,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,,,,,,,,,,,,
2,215354,-203,0,528.0,,,0,464323.5,,,0.0,-16,,True,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,,,,,,,,,,,,


Pourquoi refaire une aggrégation sur `bureau` : parce que dans `bureau` `SK_ID_CURR` n'est pas une PK, il y a donc plusieurs lignes pour une demande (11 lignes en moyenne par demande).

Là, il y a du monde : le préalable est donc d'avoir terminé mon analyse exploratoire sur l'ensemble des colonnes de `bureau` pour bien comprendre ce que représente chacune des données.

Selon les cas, il choisit un ou plusieurs des ufuncs main, max, mean, sum, var : pourquoi ? comprendre sa logique et la dépasser si possible.

In [None]:
# Bureau and bureau_balance numeric features
num_aggregations = {
    'DAYS_CREDIT': ['min', 'max', 'mean', 'var'],
    'DAYS_CREDIT_ENDDATE': ['min', 'max', 'mean'],
    'DAYS_CREDIT_UPDATE': ['mean'],
    'CREDIT_DAY_OVERDUE': ['max', 'mean'],
    'AMT_CREDIT_MAX_OVERDUE': ['mean'],
    'AMT_CREDIT_SUM': ['max', 'mean', 'sum'],
    'AMT_CREDIT_SUM_DEBT': ['max', 'mean', 'sum'],
    'AMT_CREDIT_SUM_OVERDUE': ['mean'],
    'AMT_CREDIT_SUM_LIMIT': ['mean', 'sum'],
    'AMT_ANNUITY': ['max', 'mean'],
    'CNT_CREDIT_PROLONG': ['sum'],
    'MONTHS_BALANCE_MIN': ['min'],
    'MONTHS_BALANCE_MAX': ['max'],
    'MONTHS_BALANCE_SIZE': ['mean', 'sum']
}
# Bureau and bureau_balance categorical features
cat_aggregations = {}
for cat in bureau_cat:
    cat_aggregations[cat] = ['mean']
for cat in bb_cat:
    cat_aggregations[cat + "_MEAN"] = ['mean']

In [None]:
display(cat_aggregations)

{'CREDIT_ACTIVE_Active': ['mean'],
 'CREDIT_ACTIVE_Bad debt': ['mean'],
 'CREDIT_ACTIVE_Closed': ['mean'],
 'CREDIT_ACTIVE_Sold': ['mean'],
 'CREDIT_ACTIVE_nan': ['mean'],
 'CREDIT_CURRENCY_currency 1': ['mean'],
 'CREDIT_CURRENCY_currency 2': ['mean'],
 'CREDIT_CURRENCY_currency 3': ['mean'],
 'CREDIT_CURRENCY_currency 4': ['mean'],
 'CREDIT_CURRENCY_nan': ['mean'],
 'CREDIT_TYPE_Another type of loan': ['mean'],
 'CREDIT_TYPE_Car loan': ['mean'],
 'CREDIT_TYPE_Cash loan (non-earmarked)': ['mean'],
 'CREDIT_TYPE_Consumer credit': ['mean'],
 'CREDIT_TYPE_Credit card': ['mean'],
 'CREDIT_TYPE_Interbank credit': ['mean'],
 'CREDIT_TYPE_Loan for business development': ['mean'],
 'CREDIT_TYPE_Loan for purchase of shares (margin lending)': ['mean'],
 'CREDIT_TYPE_Loan for the purchase of equipment': ['mean'],
 'CREDIT_TYPE_Loan for working capital replenishment': ['mean'],
 'CREDIT_TYPE_Microloan': ['mean'],
 'CREDIT_TYPE_Mobile operator loan': ['mean'],
 'CREDIT_TYPE_Mortgage': ['mean'],
 '

Second niveau d'aggrégation, sur les variables numériques, et sur les catégorielles :

In [None]:
bureau_agg = bureau.groupby('SK_ID_CURR').agg({**num_aggregations, **cat_aggregations})
bureau_agg.columns = pd.Index(['BURO_' + e[0] + "_" + e[1].upper() for e in bureau_agg.columns.tolist()])

In [None]:
display(bureau_agg.head(3))

Unnamed: 0_level_0,BURO_DAYS_CREDIT_MIN,BURO_DAYS_CREDIT_MAX,BURO_DAYS_CREDIT_MEAN,BURO_DAYS_CREDIT_VAR,BURO_DAYS_CREDIT_ENDDATE_MIN,BURO_DAYS_CREDIT_ENDDATE_MAX,BURO_DAYS_CREDIT_ENDDATE_MEAN,BURO_DAYS_CREDIT_UPDATE_MEAN,BURO_CREDIT_DAY_OVERDUE_MAX,BURO_CREDIT_DAY_OVERDUE_MEAN,BURO_AMT_CREDIT_MAX_OVERDUE_MEAN,BURO_AMT_CREDIT_SUM_MAX,BURO_AMT_CREDIT_SUM_MEAN,BURO_AMT_CREDIT_SUM_SUM,BURO_AMT_CREDIT_SUM_DEBT_MAX,BURO_AMT_CREDIT_SUM_DEBT_MEAN,BURO_AMT_CREDIT_SUM_DEBT_SUM,BURO_AMT_CREDIT_SUM_OVERDUE_MEAN,BURO_AMT_CREDIT_SUM_LIMIT_MEAN,BURO_AMT_CREDIT_SUM_LIMIT_SUM,BURO_AMT_ANNUITY_MAX,BURO_AMT_ANNUITY_MEAN,BURO_CNT_CREDIT_PROLONG_SUM,BURO_MONTHS_BALANCE_MIN_MIN,BURO_MONTHS_BALANCE_MAX_MAX,BURO_MONTHS_BALANCE_SIZE_MEAN,BURO_MONTHS_BALANCE_SIZE_SUM,BURO_CREDIT_ACTIVE_Active_MEAN,BURO_CREDIT_ACTIVE_Bad debt_MEAN,BURO_CREDIT_ACTIVE_Closed_MEAN,BURO_CREDIT_ACTIVE_Sold_MEAN,BURO_CREDIT_ACTIVE_nan_MEAN,BURO_CREDIT_CURRENCY_currency 1_MEAN,BURO_CREDIT_CURRENCY_currency 2_MEAN,BURO_CREDIT_CURRENCY_currency 3_MEAN,BURO_CREDIT_CURRENCY_currency 4_MEAN,BURO_CREDIT_CURRENCY_nan_MEAN,BURO_CREDIT_TYPE_Another type of loan_MEAN,BURO_CREDIT_TYPE_Car loan_MEAN,BURO_CREDIT_TYPE_Cash loan (non-earmarked)_MEAN,BURO_CREDIT_TYPE_Consumer credit_MEAN,BURO_CREDIT_TYPE_Credit card_MEAN,BURO_CREDIT_TYPE_Interbank credit_MEAN,BURO_CREDIT_TYPE_Loan for business development_MEAN,BURO_CREDIT_TYPE_Loan for purchase of shares (margin lending)_MEAN,BURO_CREDIT_TYPE_Loan for the purchase of equipment_MEAN,BURO_CREDIT_TYPE_Loan for working capital replenishment_MEAN,BURO_CREDIT_TYPE_Microloan_MEAN,BURO_CREDIT_TYPE_Mobile operator loan_MEAN,BURO_CREDIT_TYPE_Mortgage_MEAN,BURO_CREDIT_TYPE_Real estate loan_MEAN,BURO_CREDIT_TYPE_Unknown type of loan_MEAN,BURO_CREDIT_TYPE_nan_MEAN,BURO_STATUS_0_MEAN_MEAN,BURO_STATUS_1_MEAN_MEAN,BURO_STATUS_2_MEAN_MEAN,BURO_STATUS_3_MEAN_MEAN,BURO_STATUS_4_MEAN_MEAN,BURO_STATUS_5_MEAN_MEAN,BURO_STATUS_C_MEAN_MEAN,BURO_STATUS_X_MEAN_MEAN,BURO_STATUS_nan_MEAN_MEAN
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1
100001,-1572,-49,-735.0,240043.666667,-1329.0,1778.0,82.428571,-93.142857,0,0.0,,378000.0,207623.571429,1453365.0,373239.0,85240.928571,596686.5,0.0,0.0,0.0,10822.5,3545.357143,0,-51.0,0.0,24.571429,172.0,0.428571,0.0,0.571429,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.336651,0.007519,0.0,0.0,0.0,0.0,0.44124,0.21459,0.0
100002,-1437,-103,-874.0,186150.0,-1072.0,780.0,-349.0,-499.875,0,0.0,1681.029,450000.0,108131.945625,865055.565,245781.0,49156.2,245781.0,0.0,7997.14125,31988.565,0.0,0.0,0,-47.0,0.0,13.75,110.0,0.25,0.0,0.75,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.40696,0.255682,0.0,0.0,0.0,0.0,0.175426,0.161932,0.0
100003,-2586,-606,-1400.75,827783.583333,-2434.0,1216.0,-544.5,-816.0,0,0.0,0.0,810000.0,254350.125,1017400.5,0.0,0.0,0.0,0.0,202500.0,810000.0,,,0,,,,0.0,0.25,0.0,0.75,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,


Mais ce n'est pas terminé, il complète avec une aggrégation spécifique, uniquement sur les variables numériques, et sur le sous-ensemble des deux catégories Active et Closed de la variable CREDIT_ACTIVE.

Là, je ne saisis pas ce qu'il fait..

Si j'entrevois : cela revient à agréger selon les deux axes SK_ID_CURR et CREDIT_ACTIVE puis pivoter CREDIT_ACTIVE en colonnes (et en ne conservant finalement que les deux catégories principales (Sold et bad debt évincés)). Pourtant ces deux catégories plus rares me semblent informatives : bad debt semble indiquer clairement un client qui a fait défaut (il n'y en a que 21) et Sold... difficile à anticiper la motivation, mais dans un monde de tritrisation, les mauvaises créances ont tendances à être celles que l'on cède.

In [None]:
b = get_bureau()
display(b.head(3))
display(b.CREDIT_ACTIVE.value_counts())

bureau,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
0,215354,5714462,Closed,currency 1,-497,0,-153.0,-153.0,,0,91323.0,0.0,,0.0,Consumer credit,-131,
1,215354,5714463,Active,currency 1,-208,0,1075.0,,,0,225000.0,171342.0,,0.0,Credit card,-20,
2,215354,5714464,Active,currency 1,-203,0,528.0,,,0,464323.5,,,0.0,Consumer credit,-16,


CREDIT_ACTIVE
Closed      1079273
Active       630607
Sold           6527
Bad debt         21
Name: count, dtype: int64