# **Nettoyage des tables**

- Conversion de modalités ou de valeurs pour mise en cohérence (par exemple `{"Y": 1, "N": 0}`).
    - En particulier, conversion des NA codés (et donc techniquement cachés) en NA natifs
- Correction des valeurs aberrantes bien identifiées (par exemple la valeur `365243`)
- Réduction d'empreinte mémoire en castant dans le type le plus adapté.
- Regroupement et permutation de certaines colonnes (par exemple )


L'essentiel de ce travail est appuyé sur et justifié par les résultats de l'analyse exploratoire (voir les cahiers du dossier **`notebooks/eda/`**)

# Nettoyage de **`installments_payments`**

Les opérations de nettoyage ci-après nous permettent de diminuer l'empreinte mémoire de 843.4 MB à 363.2 MB.

Les `SK_ID` (clés primaires) sont dans les quantités suivantes :
- 356 255 pour `SK_ID_CURR` (intervalle [100 001, 456 255])
- 1 670 214 pour `SK_ID_PREV` (intervalle [1 000 001, 2 845 382])
- 1 716 428 pour `SK_ID_BUREAU` (intervalle [5 000 000, 6 843 457])

Remarquons que ces domaines de définition ne se chevauchent pas.

Pour encoder ces entiers, il faudrait 3 octets, et donc une combinaison d'un `int16` et d'un `int8`.

Le rapport entre les enjeux d'optimisation de l'empreinte mémoire et celui de la complexité et de la maintenabilité du code nous conduit à choisir le type `np.uint32`.

Les `NUM` sont des ordinaux :
- `NUM_INSTALMENT_NUMBER` est un sous-index longitudinal, qui renvoie donc à la temporalité (les mensualités).
    - Ce sont tous les entiers de l'intervalle [1, 277].
- `NUM_INSTALMENT_VERSION` est un numéro de version, qui marque une étape de renégociation des conditions du contrat.
    - Ils sont distribués sur l'intervalle [0, 178], avec 65 valeurs uniques.

Les `DAYS` sont des ordinaux qui représentent un nombre de jour dans le passé en référence à la demande actuelle. Ces entiers sont négatifs et compris entre -1 et -2922 pour `DAYS_INSTALMENT` et -4921 pour `DAYS_ENTRY_PAYMENT`. Comme partout ailleurs, nous inversons le signe pour travailler avec des grandeurs positives : c'est moins confus.

Les `AMT` sont des montants monétaires (flottants) positifs compris entre 0 et 3 771 487,845.

Le `float32` (ou [**simple précision**](https://en.wikipedia.org/wiki/Single-precision_floating-point_format)) utilise utilise 32 bits pour représenter un nombre à virgule flottante. La plage de valeurs va d'environ -3.4 x 10^38 à 3.4 x 10^38. La précision relative est d'environ 1/2^24, ce qui signifie qu'il peut représenter des nombres avec environ 7 chiffres significatifs. L'intervalle de valeurs est suffisant pour couvrir notre intervalle de valeurs (ce qui ne serait pas le cas avec `float16`) et notre application n'attend pas une précision fine, au centime près, sur les montants. L'enjeu est d'abord l'état de défaillance ou non d'un client, et ensuite l'ordre de grandeur des montants en jeu ((dés)espérance de perte).

Comme chaque table, et dans le but de pouvoir évaluer immédiatement la corrélation entre chaque variable et la cible, nous _targetisons_ la table, c'est-à-dire lui ajoutons la colonne `TARGET`.

|Variable|Type|Groupe|Sous-groupe|Transformations|
|-|-|-|-|-|
|**`SK_ID_PREV`**|`np.uint32`|`SK_ID`|`_`|`astype(np.uint32)`|
|**`SK_ID_CURR`**|`np.uint32`|`SK_ID`|`_`|`astype(np.uint32)`|
|**`NUM_INSTALMENT_VERSION`**|`np.uint8`|`NUM`|`_`|`astype(np.uint8)`|
|**`NUM_INSTALMENT_NUMBER`**|`np.uint16`|`NUM`|`_`|`astype(np.uint16)`|
|**`DAYS_INSTALMENT`**|`np.uint16`|`DAYS`|`_`|`negate`, `astype(np.uint16)`|
|**`DAYS_ENTRY_PAYMENT`**|`np.uint16`|`DAYS`|`_`|`negate`, `astype(np.uint16)`|
|**`AMT_INSTALMENT`**|`np.float16`|`AMT`|`_`|`astype(np.float16)`|
|**`AMT_PAYMENT`**|`np.float16`|`AMT`|`_`|`astype(np.float16)`|

Les NA : il existe une fraction marginale de NA pour 2 905 enregistrements (couples `DAYS_ENTRY_PAYMENT`, `AMT_PAYMENT`). Mais nous observons (voir ci-dessous) que le taux de défaillance est 3 fois plus élevé pour ces cas. Le

Par conséquent, nous nous permettons l'option d'éliminer ces cas (`dropna`), notamment pour pouvoir opérer nos casts (la présence de Na force le maintien en `float64`), mais prévoyons **TODO** de les explorer plus avant (voir si d'autres variables d'autres tables nous informent mieux sur ces cas et nous donne les moyens de mieux les traiter).

Ordre des colonnes :
- nous regroupons les couples 'DAYS_INSTALLMENT', `AMT_INSTALMENT` et `DAYS_ENTRY_PAYMENT`, `AMT_PAYMENT`

## Avant transformation

In [1]:
from home_credit.load import load_raw_table
from home_credit.merge import targetize
from home_credit.utils import display_frame_basic_infos

data = load_raw_table("installments_payments").copy()
targetize(data)

display_frame_basic_infos(data)
data.info()

load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\installments_payments.pqt
load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\application_train.pqt
load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\application_test.pqt
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13605401 entries, 0 to 13605400
Data columns (total 9 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   TARGET                  int8   
 1   SK_ID_PREV              int64  
 2   SK_ID_CURR              int64  
 3   NUM_INSTALMENT_VERSION  float64
 4   NUM_INSTALMENT_NUMBER   int64  
 5   DAYS_INSTALMENT         float64
 6   DAYS_ENTRY_PAYMENT      float64
 7   AMT_INSTALMENT          float64
 8   AMT_PAYMENT             float64
dtypes: float64(5), int64(3), int8(1)
memory usage: 843.4 MB


## Après transformation

In [4]:
from pepper.db_utils import cast_columns
from home_credit.feat_eng import negate_numerical_data
import numpy as np

data.dropna(inplace=True)
cast_columns(data, ["SK_ID_PREV", "SK_ID_CURR"], np.uint32)
cast_columns(data, ["NUM_INSTALMENT_VERSION"], np.uint8)
negate_numerical_data(data.DAYS_INSTALMENT)
negate_numerical_data(data.DAYS_ENTRY_PAYMENT)
cast_columns(data, ["NUM_INSTALMENT_NUMBER", "DAYS_INSTALMENT", "DAYS_ENTRY_PAYMENT"], np.uint16)
cast_columns(data, ["AMT_INSTALMENT", "AMT_PAYMENT"], np.float16)
data = data[list(data.columns[:6]) + ["AMT_INSTALMENT", "DAYS_ENTRY_PAYMENT", "AMT_PAYMENT"]]
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13602496 entries, 0 to 13605348
Data columns (total 9 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   TARGET                  int8   
 1   SK_ID_PREV              uint32 
 2   SK_ID_CURR              uint32 
 3   NUM_INSTALMENT_VERSION  uint8  
 4   NUM_INSTALMENT_NUMBER   uint16 
 5   DAYS_INSTALMENT         uint16 
 6   AMT_INSTALMENT          float16
 7   DAYS_ENTRY_PAYMENT      uint16 
 8   AMT_PAYMENT             float16
dtypes: float16(2), int8(1), uint16(3), uint32(2), uint8(1)
memory usage: 363.2 MB


## Fonctions intégrées

### **`get_clean_installments_payments`**

Extraction de la table nettoyée.

In [1]:
from home_credit.clean_up import get_clean_installments_payments

display(get_clean_installments_payments())

load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\installments_payments.pqt
load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\application_train.pqt
load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\application_test.pqt


RAW_INSTALLMENTS_PAYMENTS,TARGET,SK_ID_PREV,SK_ID_CURR,NUM_INSTALMENT_VERSION,NUM_INSTALMENT_NUMBER,DAYS_INSTALMENT,AMT_INSTALMENT,DAYS_ENTRY_PAYMENT,AMT_PAYMENT
0,0,1054186,161674,1,6,1180,6948.0,1187,6948.0
1,0,1330831,151639,0,34,2156,1717.0,2156,1717.0
2,0,2085231,193053,2,1,63,25424.0,63,25424.0
3,0,2452527,199697,1,3,2418,24352.0,2426,24352.0
4,0,2714724,167756,1,2,1383,2166.0,1366,2160.0
...,...,...,...,...,...,...,...,...,...
13605344,1,2006721,442291,1,3,1311,2934.0,1318,2934.0
13605345,1,1126000,428449,0,12,301,6792.0,302,6752.0
13605346,-1,1519070,444122,1,5,399,4364.0,407,4364.0
13605347,0,2784672,444977,0,4,157,373.0,157,373.0


### **`get_clean_installments_payments_without_entry`**

Extraction de la sous-table des cas où `DAYS_ENTRY_PAYMENT` et `AMT_PAYMENT` sont NA.

In [2]:
from home_credit.clean_up import get_clean_installments_payments_without_entry

display(get_clean_installments_payments_without_entry())

load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\installments_payments.pqt


RAW_INSTALLMENTS_PAYMENTS,TARGET,SK_ID_PREV,SK_ID_CURR,NUM_INSTALMENT_VERSION,NUM_INSTALMENT_NUMBER,DAYS_INSTALMENT,AMT_INSTALMENT
3764207,1,1531600,103793,1,7,668,49728.0
3764208,1,1947105,159974,1,24,36,22848.0
3764209,1,1843773,167270,1,22,20,48096.0
3764210,0,1691592,192536,1,5,2561,7676.0
3764211,0,1531299,157088,0,11,1847,67.5
...,...,...,...,...,...,...,...
13605396,0,2186857,428057,0,66,1624,67.5
13605397,0,1310347,414406,0,47,1539,67.5
13605398,0,1308766,402199,0,43,7,43744.0
13605399,-1,1062206,409297,0,43,1986,67.5


## Annexes

### Optimisation des types

Quel est le meilleur type pour encoder telle ou telle variable numérique ?

In [8]:
import math

print(1, math.log(117_000_000, 2))
print(2, math.log(1_716_428, 2))
print(3, math.log(1_670_214, 2))
print(4, math.log(356_255, 2))
print(5, math.log(43_000, 2))
print(6, math.log(25_229, 2))
print(7, math.log(17_912, 2))
print(8, math.log(4_921, 2))
print(9, math.log(2_922, 2))
print(10, math.log(277, 2))
print(11, math.log(178, 2))

1 26.80193328890758
2 20.71097791032366
3 20.671601532478054
4 18.44255073681091
5 15.392049039364185
6 14.622795403000701
7 14.128638812852598
8 12.264735801130138
9 11.5127404628035
10 8.113742166049189
11 7.475733430966398


### Le cas des NA

L'analyse suivante montre que le taux de défaillance des cas avec NA est 3 fois supérieur à la tendance générale.

Par conséquent, nous n'avons pas intérêt à nous débarrasser de ces cas.

In [5]:
data_na = data[data.DAYS_ENTRY_PAYMENT.isna()]
display(data_na)
data_train = data[data.TARGET != -1]
data_na_train = data_na[data_na.TARGET != -1]
display(data.TARGET.value_counts(normalize=True))
display(data_na.TARGET.value_counts(normalize=True))
display(data_train.TARGET.value_counts(normalize=True))
display(data_na_train.TARGET.value_counts(normalize=True))

RAW_INSTALLMENTS_PAYMENTS,TARGET,SK_ID_PREV,SK_ID_CURR,NUM_INSTALMENT_VERSION,NUM_INSTALMENT_NUMBER,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT
3764207,1,1531600,103793,1.0,7,-668.0,,49741.020,
3764208,1,1947105,159974,1.0,24,-36.0,,22849.515,
3764209,1,1843773,167270,1.0,22,-20.0,,48092.355,
3764210,0,1691592,192536,1.0,5,-2561.0,,7675.425,
3764211,0,1531299,157088,0.0,11,-1847.0,,67.500,
...,...,...,...,...,...,...,...,...,...
13605396,0,2186857,428057,0.0,66,-1624.0,,67.500,
13605397,0,1310347,414406,0.0,47,-1539.0,,67.500,
13605398,0,1308766,402199,0.0,43,-7.0,,43737.435,
13605399,-1,1062206,409297,0.0,43,-1986.0,,67.500,


 0    0.787286
-1    0.148015
 1    0.064699
Name: TARGET, dtype: float64

 0    0.685370
 1    0.203787
-1    0.110843
Name: TARGET, dtype: float64

0    0.924061
1    0.075939
Name: TARGET, dtype: float64

0    0.770809
1    0.229191
Name: TARGET, dtype: float64

# Nettoyage de **`bureau_balance`**

Avec 3 variables (mais 27.3 millions d'entrées), **`bureau_balance`** est l'un des tables les plus simples à traiter.

Les opérations de nettoyage ci-après nous permettent de diminuer l'empreinte mémoire de 624.8 MB à 599.5 MB et le nombre d'enregistrements de 27 299 925 à 24 179 741.

La première opération à effectuer sur la table **`bureau_balance`** est de la _currentiser_ car elle ne dispose pas nativement d'une colonne `SK_ID_CURR`. Sur cette base, il est possible de supprimer environ 3 millions d'enregistrements qui ne peuvent être rattachés à aucun client.

On effectue cette opération avant les autres, pour maximiser l'économie sur les traitements.

Les `SK_ID_BUREAU` sont distribués en  817 395 uniques sur l'intervalle [5 001 709, 6 842 888].

La moitié seulement des `SK_ID_BUREAU` de la table **`bureau`** sont représentés.

`MONTHS_BALANCE` est un ordinal et un sous-index longitudinal, qui renvoie donc à la temporalité (les mensualités).
    - Ce sont tous les entiers de l'intervalle [0, 96].

La question d'un encodage précoce de `STATUS` se pose naturellement, car avec 8 valeurs possibles, un `np.int8` serait suffisant. Ce serait cependant un empiétement sur les opérations d'ingénierie des caractéristiques opérée en aval.

|Variable|Type|Groupe|Sous-groupe|Transformations|
|-|-|-|-|-|
|**`SK_ID_BUREAU`**|`np.uint32`|`SK_ID`|`_`|`astype(np.uint32)`|
|**`MONTHS_BALANCE`**|`np.uint8`|`NUM`|`_`|`negate`, `astype(np.uint8)`|
|**`STATUS`**|`object`|`STATUS`|`_`|`_`|

## Avant transformation

In [3]:
from home_credit.load import load_raw_table
from home_credit.merge import currentize, targetize
from home_credit.utils import display_frame_basic_infos

# L'objectif est de ne pas surcharger le cache avec les versions brutes intermédiaires
data = load_raw_table("bureau_balance")
data.info()
display_frame_basic_infos(data)

load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\bureau_balance.pqt
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27299925 entries, 0 to 27299924
Data columns (total 3 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   SK_ID_BUREAU    int64 
 1   MONTHS_BALANCE  int64 
 2   STATUS          object
dtypes: int64(2), object(1)
memory usage: 624.8+ MB


## Après transformation

In [5]:
from pepper.db_utils import cast_columns
from home_credit.feat_eng import negate_numerical_data
import numpy as np

data = load_raw_table("bureau_balance")
currentize(data)
data.dropna(inplace=True)
targetize(data)
cast_columns(data, ["SK_ID_BUREAU", "SK_ID_CURR"], np.uint32)
negate_numerical_data(data.MONTHS_BALANCE)
cast_columns(data, ["MONTHS_BALANCE"], np.uint8)
data.info()

load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\bureau_balance.pqt
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24179741 entries, 0 to 27299924
Data columns (total 5 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   TARGET          int8  
 1   SK_ID_CURR      uint32
 2   SK_ID_BUREAU    uint32
 3   MONTHS_BALANCE  uint8 
 4   STATUS          object
dtypes: int8(1), object(1), uint32(2), uint8(1)
memory usage: 599.5+ MB


In [6]:
display(data.STATUS.value_counts(dropna=False))

C    11555429
0     7195282
X     5115090
1      229773
5       50334
2       20954
3        7833
4        5046
Name: STATUS, dtype: int64

## Fonctions intégrées

### **`get_clean_bureau_balance`**

Extraction de la table nettoyée.

In [7]:
from home_credit.clean_up import get_clean_bureau_balance

display(get_clean_bureau_balance())

RAW_BUREAU_BALANCE,TARGET,SK_ID_CURR,SK_ID_BUREAU,MONTHS_BALANCE,STATUS
0,0,380361,5715448,0,C
1,0,380361,5715448,1,C
2,0,380361,5715448,2,C
3,0,380361,5715448,3,C
4,0,380361,5715448,4,C
...,...,...,...,...,...
27299920,1,101874,5041336,47,X
27299921,1,101874,5041336,48,X
27299922,1,101874,5041336,49,X
27299923,1,101874,5041336,50,X


### **`get_clean_bureau_balance_with_na_current`**

Extraction du sous-ensemble des enregistrements de `bureau_balance` qui ne sont rattachés à aucun client actuel.

Cela représente 3 120 184 cas.

In [8]:
from home_credit.clean_up import get_clean_bureau_balance_with_na_current

display(get_clean_bureau_balance_with_na_current())

RAW_BUREAU_BALANCE,SK_ID_BUREAU,MONTHS_BALANCE,STATUS
1220,5717409,90,X
1221,5717409,91,0
1222,5717409,92,X
1223,5717409,93,0
1224,5717409,94,0
...,...,...,...
27299794,5041143,92,0
27299795,5041143,93,0
27299796,5041143,94,0
27299797,5041143,95,0


# Nettoyage de **`bureau`**

Caractéristiques :
* $1\,716\,428$ enregistrements.
* $17$ variables, dont $2$ **`SK`**, $4$ **`DAYS`**, $1$ **`CNT`**, $6$ **`AMT`**, $4$ **`CREDIT`**

Les opérations de nettoyage ci-après nous permettent de diminuer l'empreinte mémoire de 222.6 MB à 261.5 MB.

Les `SK_ID` (clés primaires) sont dans les quantités suivantes :
- 356 255 pour **`SK_ID_CURR`** (intervalle [100 001, 456 255])
- 1 716 428 pour **`SK_ID_BUREAU`** (intervalle [5 000 000, 6 843 457])

Remarquons que ces domaines de définition ne se chevauchent pas.

Comme précédemment, nous les convertissons en `np.uint32`.

Il y a 3 variables catégorielles comportant peu de modalités, et aucun NA :
- **`CREDIT_ACTIVE`** (4 classes) : [`Closed` (63 %), `Active` (37 %), `Sold`, `Bad debt`]
- **`CREDIT_CURRENCY`** (4 classes) : [`currency 1` (100 %), `currency 2`, `currency 3`, `currency 4`]
    - Il peut être pertinent de supprimer les rares enregistrements concernant les monnaies 2 à 4 qui peuvent induire du bruit voire des biais en raison de la différence d'échelle, alors que nous ne disposons pas des taux de change.
- **`CREDIT_TYPE`** (15 classes) : [`Consumer credit` (73 %), `Credit card` (23 %), `Car loan` (2 %), `Mortgage` (1 %), `Microloan` (1 %), ...]

Aucun NA, y compris codé à l'aide d'un code spécial.

On n'effectue pas de conversion : on laisse le one hot encoding faire ce travail en aval.

En revanche, on regroupe ces variables catégorielles qui représentent l'état actuel et les paramètres, après la rubrique des `SK_ID`.

**`CNT_CREDIT_PROLONG`** est le seul indicateur cardinal et indique le nombre de fois que le crédit a été prolongé. Il prend les valeurs 0 à 9 et peut donc être codé avec un np.int8, voire one hot encodé.


**Pour compléter ce nettoyage, et notamment imputer ou corriger certaines valeurs, il est primordial de revenir en arrière sur l'analyse exploratoire, qui a mis en évidence certaines relations et critères d'aberration.**

**Notamment, il y a trop de cas de NA qui ont une explication logique et qu'il faut retraiter ou supprimer**

**Next step, c'est donc d'actualiser mon analyse exploratoire de `bureau`**.

|Variable|Type|Groupe|Sous-groupe|Transformations|
|-|-|-|-|-|
|**`SK_ID_CURR`**|`np.uint32`|`SK_ID`|`_`|`astype(np.uint32)`|
|**`SK_ID_BUREAU`**|`np.uint32`|`SK_ID`|`_`|`astype(np.uint32)`|
|**`CREDIT_ACTIVE`**|`object`|`STATUS`|`_`|`_`|
|**`CREDIT_CURRENCY`**|`object`|`PARAM`|`_`|`_`|
|**`CREDIT_TYPE`**|`object`|`PARAM`|`_`|`_`|
|**`CNT_CREDIT_PROLONG`**|`int8`|`IND`|`_`|`_`|


In [1]:
from home_credit.load import load_raw_table
from home_credit.utils import display_frame_basic_infos
#from home_credit.merge import currentize, targetize

data = load_raw_table("bureau")
display_frame_basic_infos(data)
data.info()

load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\bureau.pqt
[1mn_samples[0m: 1 716 428
[1mn_columns[0m: 17, [('SK', 2), ('DAYS', 4), ('CNT', 1), ('AMT', 6), ('CREDIT', 4)]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1716428 entries, 0 to 1716427
Data columns (total 17 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   SK_ID_CURR              int64  
 1   SK_ID_BUREAU            int64  
 2   CREDIT_ACTIVE           object 
 3   CREDIT_CURRENCY         object 
 4   DAYS_CREDIT             int64  
 5   CREDIT_DAY_OVERDUE      int64  
 6   DAYS_CREDIT_ENDDATE     float64
 7   DAYS_ENDDATE_FACT       float64
 8   AMT_CREDIT_MAX_OVERDUE  float64
 9   CNT_CREDIT_PROLONG      int64  
 10  AMT_CREDIT_SUM          float64
 11  AMT_CREDIT_SUM_DEBT     float64
 12  AMT_CREDIT_SUM_LIMIT    float64
 13  AMT_CREDIT_SUM_OVERDUE  float64
 14  CREDIT_TYPE             object 
 15  DAYS_CREDIT_UPDATE      int64  
 16  AMT_ANNUITY          

## Après transformation

In [2]:
from home_credit.tables import Bureau
from home_credit.cols_map import _reload_config

_reload_config()
display(Bureau.cols_group("ages"))

['DAYS_CREDIT_UPDATE',
 'DAYS_CREDIT',
 'DAYS_CREDIT_ENDDATE',
 'DAYS_ENDDATE_FACT',
 'CREDIT_DAY_OVERDUE']

In [10]:
from home_credit.merge import targetize
from pepper.db_utils import cast_columns
from pepper.feat_eng import nullify
from home_credit.feat_eng import negate_numerical_data
from home_credit.tables import Bureau
import numpy as np

data = load_raw_table("bureau")
targetize(data)

ages_cols = Bureau.cols_group("ages")[:-1]
data[ages_cols] = -data[ages_cols]

cast_columns(data, ["SK_ID_CURR", "SK_ID_BUREAU"], np.uint32)
cast_columns(data, "CNT_CREDIT_PROLONG", np.uint8)

cast_columns(data, "DAYS_CREDIT_UPDATE", np.int32)
# cast_columns(data, "DAYS_CREDIT_ENDDATE", np.int32) NAs
cast_columns(data, "DAYS_CREDIT", np.uint16)
# cast_columns(data, "DAYS_ENDDATE_FACT", np.uint16) NAs
cast_columns(data, "CREDIT_DAY_OVERDUE", np.uint16)

# Downcast float32 du groupe financial_statement impossible, tous des NA sauf 1

data.set_index(["SK_ID_CURR", "SK_ID_BUREAU"], inplace=True)

data.info()

load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\bureau.pqt
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1716428 entries, (215354, 5714462) to (246829, 5057778)
Data columns (total 16 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   TARGET                  int8   
 1   CREDIT_ACTIVE           object 
 2   CREDIT_CURRENCY         object 
 3   DAYS_CREDIT             uint16 
 4   CREDIT_DAY_OVERDUE      uint16 
 5   DAYS_CREDIT_ENDDATE     float64
 6   DAYS_ENDDATE_FACT       float64
 7   AMT_CREDIT_MAX_OVERDUE  float64
 8   CNT_CREDIT_PROLONG      uint8  
 9   AMT_CREDIT_SUM          float64
 10  AMT_CREDIT_SUM_DEBT     float64
 11  AMT_CREDIT_SUM_LIMIT    float64
 12  AMT_CREDIT_SUM_OVERDUE  float64
 13  CREDIT_TYPE             object 
 14  DAYS_CREDIT_UPDATE      int32  
 15  AMT_ANNUITY             float64
dtypes: float64(8), int32(1), int8(1), object(3), uint16(2), uint8(1)
memory usage: 261.5+ MB


# Nettoyage de **`pos_cash_balance`**

Les opérations de nettoyage ci-après nous permettent de :
- de passer de 10 001 358 enregistrements à 9 975 174 après élimination de 26 184 échantillons avec valeur manquante.
- diminuer l'empreinte mémoire de 610.4 MB à 304.4 MB.

⚠ $37\,422$ ($2.2\,\%$ des) clients demandeurs actuels ont des détails dans `pos_cash_balance` alors qu'ils ne sont pas enregistrés dans `previous_application`.

`nullify(data.NAME_CONTRACT_STATUS, "XNA")` : les deux seuls cas NA le sont également pour les `CNT_`, ils n'apparaissent donc pas dans la sortie nettoyée.

Rappel des 8 classes :
* $91\,\%$ `Active`
* $7.4\,\%$ `Completed`
* $0.9\,\%$ `Signed`
* $0.1\,\%$ `Demand`
* $0.1\,\%$ `Returned to the store`
* $\varepsilon$ `Approved`
* $\varepsilon$ `Amortized debt`
* $\varepsilon$ `Canceled`

26 184 `CNT_*` en NA, soit moins de 3 pour 1000.

Les `CNT` :
- `CNT_INSTALMENT` : [1, 92]
- `CNT_INSTALMENT_FUTURE` : [0, 85]

Les faux `SK` et vrais `DAYS`
- `SK_DPD` : [0, 4 231]
- `SK_DPD_DEF` : [0, 3 595]

Transtypages :

|Variable|Type|Groupe|Sous-groupe|Transformations|
|-|-|-|-|-|
|**`SK_ID_CURR`**|`np.uint32`|`SK_ID`|`_`|`astype(np.uint32)`|
|**`SK_ID_BUREAU`**|`np.uint32`|`SK_ID`|`_`|`astype(np.uint32)`|
|**`MONTHS_BALANCE`**|`np.uint8`|`NUM`|`_`|`negate`, `astype(np.uint8)`|
|**`CNT_INSTALMENT`**|`np.uint8`|`CNT`|`_`|`astype(np.uint8)`|
|**`CNT_INSTALMENT_FUTURE`**|`np.uint8`|`CNT`|`_`|`astype(np.uint8)`|
|**`NAME_CONTRACT_STATUS`**|`object`|`STATUS`|`_`|`_`|
|**`SK_DPD`**|`np.uint16`|`DAYS`|`_`|`astype(np.uint16)`|
|**`SK_DPD_DEF`**|`np.uint16`|`DAYS`|`_`|`astype(np.uint16)`|


## Avant transformation

In [1]:
from home_credit.load import load_raw_table
from home_credit.utils import display_frame_basic_infos

data = load_raw_table("pos_cash_balance")
display_frame_basic_infos(data)
data.info()

load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\pos_cash_balance.pqt
[1mn_samples[0m: 10 001 358
[1mn_columns[0m: 8, [('SK', 4), ('NAME', 1), ('CNT', 2), ('MONTHS', 1)]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10001358 entries, 0 to 10001357
Data columns (total 8 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   SK_ID_PREV             int64  
 1   SK_ID_CURR             int64  
 2   MONTHS_BALANCE         int64  
 3   CNT_INSTALMENT         float64
 4   CNT_INSTALMENT_FUTURE  float64
 5   NAME_CONTRACT_STATUS   object 
 6   SK_DPD                 int64  
 7   SK_DPD_DEF             int64  
dtypes: float64(2), int64(5), object(1)
memory usage: 610.4+ MB


## Après transformation

In [5]:
from home_credit.merge import targetize
from pepper.db_utils import cast_columns
from pepper.feat_eng import nullify
from home_credit.feat_eng import negate_numerical_data
import numpy as np

data = load_raw_table("pos_cash_balance")
targetize(data)
nullify(data.NAME_CONTRACT_STATUS, "XNA")
cast_columns(data, ["SK_ID_PREV", "SK_ID_CURR"], np.uint32)
negate_numerical_data(data.MONTHS_BALANCE)
cast_columns(data, ["MONTHS_BALANCE"], np.uint8)
cast_columns(data, ["SK_DPD", "SK_DPD_DEF"], np.uint16)

data_na = data[data.isna().any(axis=1)]
data.dropna(inplace=True)
cast_columns(data, ["CNT_INSTALMENT", "CNT_INSTALMENT_FUTURE"], np.uint8)

data.info()

load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\pos_cash_balance.pqt
load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\application_train.pqt
load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\application_test.pqt
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9975174 entries, 0 to 10001357
Data columns (total 9 columns):
 #   Column                 Dtype 
---  ------                 ----- 
 0   TARGET                 int8  
 1   SK_ID_PREV             uint32
 2   SK_ID_CURR             uint32
 3   MONTHS_BALANCE         uint8 
 4   CNT_INSTALMENT         uint8 
 5   CNT_INSTALMENT_FUTURE  uint8 
 6   NAME_CONTRACT_STATUS   object
 7   SK_DPD                 uint16
 8   SK_DPD_DEF             uint16
dtypes: int8(1), object(1), uint16(2), uint32(2), uint8(3)
memory usage: 304.4+ MB


## Fonctions intégrées

### **`get_clean_pos_cash_balance`**

Extraction de la table nettoyée.

In [2]:
from home_credit.clean_up import get_clean_pos_cash_balance

display(get_clean_pos_cash_balance())

load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\pos_cash_balance.pqt


RAW_POS_CASH_BALANCE,TARGET,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,CNT_INSTALMENT,CNT_INSTALMENT_FUTURE,NAME_CONTRACT_STATUS,SK_DPD,SK_DPD_DEF
0,0,1803195,182943,31,48,45,Active,0,0
1,0,1715348,367990,33,36,35,Active,0,0
2,0,1784872,397406,32,12,9,Active,0,0
3,0,1903291,269225,35,48,42,Active,0,0
4,0,2341044,334279,35,36,35,Active,0,0
...,...,...,...,...,...,...,...,...,...
10001353,0,2448283,226558,20,6,0,Active,843,0
10001354,0,1717234,141565,19,12,0,Active,602,0
10001355,0,1283126,315695,21,10,0,Active,609,0
10001356,0,1082516,450255,22,12,0,Active,614,0


### **`get_clean_pos_cash_balance_with_na`**

Extraction du sous-ensemble des enregistrements de `pos_cash_balance` qui comportent au moins un NA.

Cela représente 26 184 cas.

In [1]:
from home_credit.clean_up import get_clean_pos_cash_balance_with_na

display(get_clean_pos_cash_balance_with_na())

load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\pos_cash_balance.pqt
load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\application_train.pqt
load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\application_test.pqt


RAW_POS_CASH_BALANCE,TARGET,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,CNT_INSTALMENT,CNT_INSTALMENT_FUTURE,NAME_CONTRACT_STATUS,SK_DPD,SK_DPD_DEF
709,-1,1889585,403422,6,-1,-1,Signed,0,0
759,0,1618886,382448,2,-1,-1,Signed,0,0
1887,-1,2288203,429576,6,-1,-1,Signed,0,0
1899,-1,2110224,167171,6,-1,-1,Signed,0,0
1910,-1,2031967,235187,5,-1,-1,Signed,0,0
...,...,...,...,...,...,...,...,...,...
9998668,0,1770932,441177,10,-1,-1,Signed,0,0
9998696,0,1770932,441177,11,-1,-1,Signed,0,0
9999114,0,1770932,441177,8,-1,-1,Signed,0,0
9999116,0,1770932,441177,9,-1,-1,Signed,0,0


# Nettoyage de **`credit_card_balance`**

Les opérations de nettoyage ci-après nous permettent de diminuer l'empreinte mémoire de 673.9 MB à _ MB.

Les opérations de nettoyage ci-après nous permettent de :
- de passer de 3 840 312 enregistrements à _ après élimination de _ échantillons _.
- diminuer l'empreinte mémoire de 610.4 MB à 304.4 MB.

## Avant transformation

In [3]:
from home_credit.load import load_raw_table
from home_credit.utils import display_frame_basic_infos

data = load_raw_table("credit_card_balance")
data.info()
display_frame_basic_infos(data)

load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\credit_card_balance.pqt
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3840312 entries, 0 to 3840311
Data columns (total 23 columns):
 #   Column                      Dtype  
---  ------                      -----  
 0   SK_ID_PREV                  int64  
 1   SK_ID_CURR                  int64  
 2   MONTHS_BALANCE              int64  
 3   AMT_BALANCE                 float64
 4   AMT_CREDIT_LIMIT_ACTUAL     int64  
 5   AMT_DRAWINGS_ATM_CURRENT    float64
 6   AMT_DRAWINGS_CURRENT        float64
 7   AMT_DRAWINGS_OTHER_CURRENT  float64
 8   AMT_DRAWINGS_POS_CURRENT    float64
 9   AMT_INST_MIN_REGULARITY     float64
 10  AMT_PAYMENT_CURRENT         float64
 11  AMT_PAYMENT_TOTAL_CURRENT   float64
 12  AMT_RECEIVABLE_PRINCIPAL    float64
 13  AMT_RECIVABLE               float64
 14  AMT_TOTAL_RECEIVABLE        float64
 15  CNT_DRAWINGS_ATM_CURRENT    float64
 16  CNT_DRAWINGS_CURRENT        int64  
 17  CNT_DRAWING

## Après transformation

In [30]:
from home_credit.merge import targetize
from pepper.db_utils import cast_columns
from pepper.feat_eng import nullify
from home_credit.feat_eng import negate_numerical_data
import numpy as np

data = load_raw_table("credit_card_balance")
targetize(data)
negate_numerical_data(data.MONTHS_BALANCE)

load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\credit_card_balance.pqt


## Relations entre les NA

3 quantités synchronisées :
- 305 236 (8 %) sur `AMT_INST_MIN_REGULARITY` et `CNT_INSTALMENT_MATURE_CUM`
- 749 816 (19,5 %) sur `AMT_DRAWINGS_ATM_CURRENT`, `AMT_DRAWINGS_OTHER_CURRENT`, `AMT_DRAWINGS_POS_CURRENT`, `CNT_DRAWINGS_ATM_CURRENT`, `CNT_DRAWINGS_OTHER_CURRENT`, `CNT_DRAWINGS_POS_CURRENT`
- 767 988 (20 %) sur `AMT_PAYMENT_CURRENT`

In [31]:
display(data)

RAW_CREDIT_CARD_BALANCE,TARGET,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,AMT_BALANCE,AMT_CREDIT_LIMIT_ACTUAL,AMT_DRAWINGS_ATM_CURRENT,AMT_DRAWINGS_CURRENT,AMT_DRAWINGS_OTHER_CURRENT,AMT_DRAWINGS_POS_CURRENT,...,AMT_RECIVABLE,AMT_TOTAL_RECEIVABLE,CNT_DRAWINGS_ATM_CURRENT,CNT_DRAWINGS_CURRENT,CNT_DRAWINGS_OTHER_CURRENT,CNT_DRAWINGS_POS_CURRENT,CNT_INSTALMENT_MATURE_CUM,NAME_CONTRACT_STATUS,SK_DPD,SK_DPD_DEF
0,-1,2562384,378907,6,56.970,135000,0.0,877.5,0.0,877.5,...,0.000,0.000,0.0,1,0.0,1.0,35.0,Active,0,0
1,1,2582071,363914,1,63975.555,45000,2250.0,2250.0,0.0,0.0,...,64875.555,64875.555,1.0,1,0.0,0.0,69.0,Active,0,0
2,-1,1740877,371185,7,31815.225,450000,0.0,0.0,0.0,0.0,...,31460.085,31460.085,0.0,0,0.0,0.0,30.0,Active,0,0
3,0,1389973,337855,4,236572.110,225000,2250.0,2250.0,0.0,0.0,...,233048.970,233048.970,1.0,1,0.0,0.0,10.0,Active,0,0
4,0,1891521,126868,1,453919.455,450000,0.0,11547.0,0.0,11547.0,...,453919.455,453919.455,0.0,1,0.0,1.0,101.0,Active,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3840307,0,1036507,328243,9,0.000,45000,,0.0,,,...,0.000,0.000,,0,,,0.0,Active,0,0
3840308,0,1714892,347207,9,0.000,45000,0.0,0.0,0.0,0.0,...,0.000,0.000,0.0,0,0.0,0.0,23.0,Active,0,0
3840309,0,1302323,215757,9,275784.975,585000,270000.0,270000.0,0.0,0.0,...,273093.975,273093.975,2.0,2,0.0,0.0,18.0,Active,0,0
3840310,0,1624872,430337,10,0.000,450000,,0.0,,,...,0.000,0.000,,0,,,0.0,Active,0,0


In [32]:
display(data.NAME_CONTRACT_STATUS.value_counts(dropna=False))

Active           3698436
Completed         128918
Signed             11058
Demand              1365
Sent proposal        513
Refused               17
Approved               5
Name: NAME_CONTRACT_STATUS, dtype: int64

Focus :

In [43]:
data_na = data[data.isna().any(axis=1)].copy()
data_na = data_na[[
    "TARGET", "SK_ID_PREV", "SK_ID_CURR", "MONTHS_BALANCE",
    "AMT_INST_MIN_REGULARITY", "CNT_INSTALMENT_MATURE_CUM",
    "AMT_DRAWINGS_ATM_CURRENT", "CNT_DRAWINGS_ATM_CURRENT",
    "AMT_DRAWINGS_OTHER_CURRENT", "CNT_DRAWINGS_OTHER_CURRENT",
    "AMT_DRAWINGS_POS_CURRENT", "CNT_DRAWINGS_POS_CURRENT", 
    "AMT_PAYMENT_CURRENT"
]]
display(data_na)

RAW_CREDIT_CARD_BALANCE,TARGET,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,AMT_INST_MIN_REGULARITY,CNT_INSTALMENT_MATURE_CUM,AMT_DRAWINGS_ATM_CURRENT,CNT_DRAWINGS_ATM_CURRENT,AMT_DRAWINGS_OTHER_CURRENT,CNT_DRAWINGS_OTHER_CURRENT,AMT_DRAWINGS_POS_CURRENT,CNT_DRAWINGS_POS_CURRENT,AMT_PAYMENT_CURRENT
45,0,2657726,399970,5,0.0,0.0,,,,,,,
47,-1,1517613,121258,6,0.0,0.0,,,,,,,
49,0,2408643,104761,4,0.0,0.0,,,,,,,
52,0,1322825,215709,5,0.0,0.0,,,,,,,
60,0,1217908,162464,5,0.0,0.0,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3840277,0,1260295,120447,7,0.0,0.0,101250.0,3.0,0.0,0.0,0.0,0.0,
3840303,0,1307188,385981,9,0.0,0.0,,,,,,,
3840306,0,1410474,255737,13,0.0,0.0,,,,,,,
3840307,0,1036507,328243,9,0.0,0.0,,,,,,,


826 036 en tout, il y a donc du recouvrement.

In [44]:
is_installment_na = data_na.AMT_INST_MIN_REGULARITY.isna()
is_drawings_na = data_na.AMT_DRAWINGS_ATM_CURRENT.isna()
is_payment_na = data_na.AMT_PAYMENT_CURRENT.isna()
print("       INA:", data_na[is_installment_na].shape[0])
print("       DNA:", data_na[is_drawings_na].shape[0])
print("       PNA:", data_na[is_payment_na].shape[0])
print("  INA & DNA & PNA:", data_na[is_installment_na & is_drawings_na & is_payment_na].shape[0])
print(" INA & DNA & !PNA:", data_na[is_installment_na & is_drawings_na & ~is_payment_na].shape[0])
print("INA & !DNA & PNA:", data_na[is_installment_na & ~is_drawings_na & is_payment_na].shape[0])
print("!INA & DNA & PNA:", data_na[~is_installment_na & is_drawings_na & is_payment_na].shape[0])
print(" INA & !DNA & !PNA:", data_na[is_installment_na & ~is_drawings_na & ~is_payment_na].shape[0])
print(" !INA & DNA & !PNA:", data_na[~is_installment_na & is_drawings_na & ~is_payment_na].shape[0])
print(" !INA & !DNA & PNA:", data_na[~is_installment_na & ~is_drawings_na & is_payment_na].shape[0])

       INA: 305236
       DNA: 749816
       PNA: 767988
  INA & DNA & PNA: 243091
 INA & DNA & !PNA: 19559
INA & !DNA & PNA: 12131
!INA & DNA & PNA: 479132
 INA & !DNA & !PNA: 30455
 !INA & DNA & !PNA: 8034
 !INA & !DNA & PNA: 33634


Constatons que l'assertion `drw == drw_atm + drw_pos + drw_other` est vérifiée dans $100\,\%$ des cas.

* **`CNT_DRAWINGS_CURRENT`** : nombre de retraits effectués au cours du dernier mois.
* **`CNT_DRAWINGS_ATM_CURRENT`** : nombre de retraits DAB effectués au cours du dernier mois.
* **`CNT_DRAWINGS_POS_CURRENT`** : nombre de retraits effectués au cours du dernier mois pour l'achat de biens.
* **`CNT_DRAWINGS_OTHER_CURRENT`** : nombre de retraits autres que DAB et POS effectués au cours du dernier mois.
* **`CNT_INSTALMENT_MATURE_CUM`** : nombre d'échéances honorées.

On doit pouvoir faire de l'interpolation. Pour un client donné, il y a un profil d'utilisation préférentielle de tel ou tel moyen de paiement. Connaissant le total mensuel, on doit pouvoir en déduire la ventilation par moyen de paiement.

Surprise sur un premier cas client : toute la série est en NA.

Est-ce toujours le cas ?

In [36]:
display(data[data.SK_ID_CURR == 399970])

RAW_CREDIT_CARD_BALANCE,TARGET,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,AMT_INST_MIN_REGULARITY,CNT_INSTALMENT_MATURE_CUM,AMT_DRAWINGS_ATM_CURRENT,CNT_DRAWINGS_ATM_CURRENT,AMT_DRAWINGS_OTHER_CURRENT,CNT_DRAWINGS_OTHER_CURRENT,AMT_DRAWINGS_POS_CURRENT,CNT_DRAWINGS_POS_CURRENT,AMT_PAYMENT_CURRENT
45,0,2657726,399970,5,0.0,0.0,,,,,,,
429048,0,2657726,399970,10,0.0,0.0,,,,,,,
1037995,0,2657726,399970,9,0.0,0.0,,,,,,,
1273111,0,2657726,399970,4,0.0,0.0,,,,,,,
1349724,0,2657726,399970,11,0.0,0.0,,,,,,,
1878187,0,2657726,399970,12,0.0,0.0,,,,,,,
2071042,0,2657726,399970,2,0.0,0.0,,,,,,,
2128735,0,2657726,399970,1,0.0,0.0,,,,,,,
2538544,0,2657726,399970,7,0.0,0.0,,,,,,,
2814715,0,2657726,399970,3,0.0,0.0,,,,,,,


Nombre de clients uniques concernés : 103 558

In [45]:
print(data_na.SK_ID_CURR.nunique())

89877


Prédicat : toute la série est homogène :

En fait, cela semble être effectivement vérifié pour un couple SK_ID_CURR, SK_ID_PREV

IL faut donc s'intéresser au SK_ID_PREV

In [64]:
def is_user_config_constant(sk_id_prev: int):
    user = data_na[data_na.SK_ID_PREV == sk_id_prev].copy()
    user.drop(columns="MONTHS_BALANCE", inplace=True)
    dropped = user.drop_duplicates()
    n = len(dropped)
    if n > 1:
        display(dropped)
    return n  == 1

print("prev 2657726:", is_user_config_constant(2657726))
print("prev 1517613:", is_user_config_constant(1517613))

for prev in data_na.SK_ID_PREV.unique()[:20]:
    if st := is_user_config_constant(prev):
        print(".", end="")
    else:
        print(f"prev {prev}")

prev 2657726: True
prev 1517613: True
..........

RAW_CREDIT_CARD_BALANCE,TARGET,SK_ID_PREV,SK_ID_CURR,AMT_INST_MIN_REGULARITY,CNT_INSTALMENT_MATURE_CUM,AMT_DRAWINGS_ATM_CURRENT,CNT_DRAWINGS_ATM_CURRENT,AMT_DRAWINGS_OTHER_CURRENT,CNT_DRAWINGS_OTHER_CURRENT,AMT_DRAWINGS_POS_CURRENT,CNT_DRAWINGS_POS_CURRENT,AMT_PAYMENT_CURRENT
93,0,1821360,302820,0.0,0.0,,,,,,,
22610,0,1821360,302820,,,,,,,,,


prev 1821360


RAW_CREDIT_CARD_BALANCE,TARGET,SK_ID_PREV,SK_ID_CURR,AMT_INST_MIN_REGULARITY,CNT_INSTALMENT_MATURE_CUM,AMT_DRAWINGS_ATM_CURRENT,CNT_DRAWINGS_ATM_CURRENT,AMT_DRAWINGS_OTHER_CURRENT,CNT_DRAWINGS_OTHER_CURRENT,AMT_DRAWINGS_POS_CURRENT,CNT_DRAWINGS_POS_CURRENT,AMT_PAYMENT_CURRENT
99,-1,1128942,281671,0.0,0.0,,,,,,,
83117,-1,1128942,281671,,,,,,,,,


prev 1128942
.

RAW_CREDIT_CARD_BALANCE,TARGET,SK_ID_PREV,SK_ID_CURR,AMT_INST_MIN_REGULARITY,CNT_INSTALMENT_MATURE_CUM,AMT_DRAWINGS_ATM_CURRENT,CNT_DRAWINGS_ATM_CURRENT,AMT_DRAWINGS_OTHER_CURRENT,CNT_DRAWINGS_OTHER_CURRENT,AMT_DRAWINGS_POS_CURRENT,CNT_DRAWINGS_POS_CURRENT,AMT_PAYMENT_CURRENT
115,1,1565409,216500,0.0,0.0,,,,,,,
2618491,1,1565409,216500,,,,,,,,,


prev 1565409
......

In [65]:
display(data_na[data_na.SK_ID_PREV == 1565409])

RAW_CREDIT_CARD_BALANCE,TARGET,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,AMT_INST_MIN_REGULARITY,CNT_INSTALMENT_MATURE_CUM,AMT_DRAWINGS_ATM_CURRENT,CNT_DRAWINGS_ATM_CURRENT,AMT_DRAWINGS_OTHER_CURRENT,CNT_DRAWINGS_OTHER_CURRENT,AMT_DRAWINGS_POS_CURRENT,CNT_DRAWINGS_POS_CURRENT,AMT_PAYMENT_CURRENT
115,1,1565409,216500,4,0.0,0.0,,,,,,,
20370,1,1565409,216500,5,0.0,0.0,,,,,,,
66443,1,1565409,216500,21,0.0,0.0,,,,,,,
151856,1,1565409,216500,19,0.0,0.0,,,,,,,
389987,1,1565409,216500,17,0.0,0.0,,,,,,,
437961,1,1565409,216500,3,0.0,0.0,,,,,,,
543751,1,1565409,216500,20,0.0,0.0,,,,,,,
576190,1,1565409,216500,22,0.0,0.0,,,,,,,
685986,1,1565409,216500,12,0.0,0.0,,,,,,,
732012,1,1565409,216500,18,0.0,0.0,,,,,,,


In [53]:
users = data_na.drop(columns=["TARGET", "MONTHS_BALANCE"])
display(users)
unique_users = users.drop_duplicates()
display(unique_users)
unique_users_sks = unique_users[["SK_ID_PREV", "SK_ID_CURR"]]
display(len(unique_users_sks.drop_duplicates()))

RAW_CREDIT_CARD_BALANCE,SK_ID_PREV,SK_ID_CURR,AMT_INST_MIN_REGULARITY,CNT_INSTALMENT_MATURE_CUM,AMT_DRAWINGS_ATM_CURRENT,CNT_DRAWINGS_ATM_CURRENT,AMT_DRAWINGS_OTHER_CURRENT,CNT_DRAWINGS_OTHER_CURRENT,AMT_DRAWINGS_POS_CURRENT,CNT_DRAWINGS_POS_CURRENT,AMT_PAYMENT_CURRENT
45,2657726,399970,0.0,0.0,,,,,,,
47,1517613,121258,0.0,0.0,,,,,,,
49,2408643,104761,0.0,0.0,,,,,,,
52,1322825,215709,0.0,0.0,,,,,,,
60,1217908,162464,0.0,0.0,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
3840277,1260295,120447,0.0,0.0,101250.0,3.0,0.0,0.0,0.0,0.0,
3840303,1307188,385981,0.0,0.0,,,,,,,
3840306,1410474,255737,0.0,0.0,,,,,,,
3840307,1036507,328243,0.0,0.0,,,,,,,


RAW_CREDIT_CARD_BALANCE,SK_ID_PREV,SK_ID_CURR,AMT_INST_MIN_REGULARITY,CNT_INSTALMENT_MATURE_CUM,AMT_DRAWINGS_ATM_CURRENT,CNT_DRAWINGS_ATM_CURRENT,AMT_DRAWINGS_OTHER_CURRENT,CNT_DRAWINGS_OTHER_CURRENT,AMT_DRAWINGS_POS_CURRENT,CNT_DRAWINGS_POS_CURRENT,AMT_PAYMENT_CURRENT
45,2657726,399970,0.0,0.0,,,,,,,
47,1517613,121258,0.0,0.0,,,,,,,
49,2408643,104761,0.0,0.0,,,,,,,
52,1322825,215709,0.0,0.0,,,,,,,
60,1217908,162464,0.0,0.0,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
3839877,1285739,386434,0.0,0.0,,,,,,,
3840078,1027516,345883,0.0,0.0,0.0,0.0,0.0,0.0,810.00,1.0,
3840187,1202157,269201,0.0,0.0,,,,,,,
3840235,2738440,342993,0.0,0.0,135000.0,3.0,0.0,0.0,5368.05,2.0,


90313

Il faudrait voir les sommes là où il y a des NA :

Je viens de démontrer que si on a des NA, cela correspond exclusivement à des `CNT_DRAWINGS_CURRENT = 0`, ce qui permet d'en déduire que les éléments valent 0 !

In [68]:
cnt_drawings = data[[
    "TARGET", "SK_ID_PREV", "SK_ID_CURR", "MONTHS_BALANCE",
    "CNT_DRAWINGS_CURRENT",
    "CNT_DRAWINGS_ATM_CURRENT", "CNT_DRAWINGS_POS_CURRENT", "CNT_DRAWINGS_OTHER_CURRENT"
]]

cnt_drawings_na = cnt_drawings[cnt_drawings.CNT_DRAWINGS_ATM_CURRENT.isna()]  #.copy()
"""data_na = data_na[[
    "TARGET", "SK_ID_PREV", "SK_ID_CURR", "MONTHS_BALANCE",
    "AMT_INST_MIN_REGULARITY", "CNT_INSTALMENT_MATURE_CUM",
    "AMT_DRAWINGS_ATM_CURRENT", "CNT_DRAWINGS_ATM_CURRENT",
    "AMT_DRAWINGS_OTHER_CURRENT", "CNT_DRAWINGS_OTHER_CURRENT",
    "AMT_DRAWINGS_POS_CURRENT", "CNT_DRAWINGS_POS_CURRENT", 
    "AMT_PAYMENT_CURRENT"
]]"""
display(cnt_drawings_na)
display(cnt_drawings_na[cnt_drawings_na.CNT_DRAWINGS_CURRENT > 0])


RAW_CREDIT_CARD_BALANCE,TARGET,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,CNT_DRAWINGS_CURRENT,CNT_DRAWINGS_ATM_CURRENT,CNT_DRAWINGS_POS_CURRENT,CNT_DRAWINGS_OTHER_CURRENT
45,0,2657726,399970,5,0,,,
47,-1,1517613,121258,6,0,,,
49,0,2408643,104761,4,0,,,
52,0,1322825,215709,5,0,,,
60,0,1217908,162464,5,0,,,
...,...,...,...,...,...,...,...,...
3840272,0,2463643,315621,15,0,,,
3840303,0,1307188,385981,9,0,,,
3840306,0,1410474,255737,13,0,,,
3840307,0,1036507,328243,9,0,,,


RAW_CREDIT_CARD_BALANCE,TARGET,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,CNT_DRAWINGS_CURRENT,CNT_DRAWINGS_ATM_CURRENT,CNT_DRAWINGS_POS_CURRENT,CNT_DRAWINGS_OTHER_CURRENT


Même démonstration avec les montants.

Ce serait vraiment pas mal de rendre cela élégant à l'aide d'asserts.

In [70]:
display(data)

RAW_CREDIT_CARD_BALANCE,TARGET,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,AMT_BALANCE,AMT_CREDIT_LIMIT_ACTUAL,AMT_DRAWINGS_ATM_CURRENT,AMT_DRAWINGS_CURRENT,AMT_DRAWINGS_OTHER_CURRENT,AMT_DRAWINGS_POS_CURRENT,...,AMT_RECIVABLE,AMT_TOTAL_RECEIVABLE,CNT_DRAWINGS_ATM_CURRENT,CNT_DRAWINGS_CURRENT,CNT_DRAWINGS_OTHER_CURRENT,CNT_DRAWINGS_POS_CURRENT,CNT_INSTALMENT_MATURE_CUM,NAME_CONTRACT_STATUS,SK_DPD,SK_DPD_DEF
0,-1,2562384,378907,6,56.970,135000,0.0,877.5,0.0,877.5,...,0.000,0.000,0.0,1,0.0,1.0,35.0,Active,0,0
1,1,2582071,363914,1,63975.555,45000,2250.0,2250.0,0.0,0.0,...,64875.555,64875.555,1.0,1,0.0,0.0,69.0,Active,0,0
2,-1,1740877,371185,7,31815.225,450000,0.0,0.0,0.0,0.0,...,31460.085,31460.085,0.0,0,0.0,0.0,30.0,Active,0,0
3,0,1389973,337855,4,236572.110,225000,2250.0,2250.0,0.0,0.0,...,233048.970,233048.970,1.0,1,0.0,0.0,10.0,Active,0,0
4,0,1891521,126868,1,453919.455,450000,0.0,11547.0,0.0,11547.0,...,453919.455,453919.455,0.0,1,0.0,1.0,101.0,Active,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3840307,0,1036507,328243,9,0.000,45000,,0.0,,,...,0.000,0.000,,0,,,0.0,Active,0,0
3840308,0,1714892,347207,9,0.000,45000,0.0,0.0,0.0,0.0,...,0.000,0.000,0.0,0,0.0,0.0,23.0,Active,0,0
3840309,0,1302323,215757,9,275784.975,585000,270000.0,270000.0,0.0,0.0,...,273093.975,273093.975,2.0,2,0.0,0.0,18.0,Active,0,0
3840310,0,1624872,430337,10,0.000,450000,,0.0,,,...,0.000,0.000,,0,,,0.0,Active,0,0


In [78]:
amt_drawings = data[[
    "TARGET", "SK_ID_PREV", "SK_ID_CURR", "MONTHS_BALANCE",
    "AMT_DRAWINGS_CURRENT",
    "AMT_DRAWINGS_ATM_CURRENT", "AMT_DRAWINGS_POS_CURRENT", "AMT_DRAWINGS_OTHER_CURRENT"
]]

amt_drawings_na = amt_drawings[amt_drawings.AMT_DRAWINGS_ATM_CURRENT.isna()]  #.copy()
display(amt_drawings_na)
display(amt_drawings_na[amt_drawings_na.AMT_DRAWINGS_CURRENT > 0])

RAW_CREDIT_CARD_BALANCE,TARGET,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,AMT_DRAWINGS_CURRENT,AMT_DRAWINGS_ATM_CURRENT,AMT_DRAWINGS_POS_CURRENT,AMT_DRAWINGS_OTHER_CURRENT
45,0,2657726,399970,5,0.0,,,
47,-1,1517613,121258,6,0.0,,,
49,0,2408643,104761,4,0.0,,,
52,0,1322825,215709,5,0.0,,,
60,0,1217908,162464,5,0.0,,,
...,...,...,...,...,...,...,...,...
3840272,0,2463643,315621,15,0.0,,,
3840303,0,1307188,385981,9,0.0,,,
3840306,0,1410474,255737,13,0.0,,,
3840307,0,1036507,328243,9,0.0,,,


RAW_CREDIT_CARD_BALANCE,TARGET,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,AMT_DRAWINGS_CURRENT,AMT_DRAWINGS_ATM_CURRENT,AMT_DRAWINGS_POS_CURRENT,AMT_DRAWINGS_OTHER_CURRENT


Pour être rigoureux : sur les `CNT`, il n'y a que des entiers positifs, donc on peut conclure que les éléments ventilés sont à 0.

En revanche, on constate que certains `AMT_DRAWINGS_ATM_CURRENT` peuvent être négatifs, ce qui permettrait d'avoir une somme nulle dont les éléments ne le sont pas.

Mais il n'y a qu'une seul valeur de ce type, manifestement aberrante.

En outre, `AMT_DRAWINGS_CURRENT` n'est pas ici la somme de ses parties.

Un retour sur l'analyse exploratoire montre que cette vérification n'a pas été faite.

La vérification de la relation montre qu'elle est vraie dans 99,9 % des cas (7 150 violent la relation).

In [72]:
display(amt_drawings[amt_drawings.AMT_DRAWINGS_ATM_CURRENT < 0])

RAW_CREDIT_CARD_BALANCE,TARGET,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,AMT_DRAWINGS_CURRENT,AMT_DRAWINGS_ATM_CURRENT,AMT_DRAWINGS_POS_CURRENT,AMT_DRAWINGS_OTHER_CURRENT
2047409,0,1793522,317928,20,0.0,-6827.31,0.0,0.0


In [80]:
# display(amt_drawings)
amt_diff = amt_drawings.copy()

amt_diff["SUM"] = (
    amt_diff.AMT_DRAWINGS_ATM_CURRENT
    + amt_diff.AMT_DRAWINGS_POS_CURRENT
    + amt_diff.AMT_DRAWINGS_OTHER_CURRENT
)

amt_diff["DIFF"] = (amt_diff.AMT_DRAWINGS_CURRENT - amt_diff.SUM).round(2)

irreg_amt_diff = amt_diff[(amt_diff.DIFF != 0) & amt_diff.AMT_DRAWINGS_ATM_CURRENT.notna()]

display(irreg_amt_diff)

RAW_CREDIT_CARD_BALANCE,TARGET,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,AMT_DRAWINGS_CURRENT,AMT_DRAWINGS_ATM_CURRENT,AMT_DRAWINGS_POS_CURRENT,AMT_DRAWINGS_OTHER_CURRENT,SUM,DIFF
53,0,1894367,113120,1,2145.780,0.0,0.0,0.0,0.0,2145.78
650,0,2155865,344228,5,27003.645,0.0,0.0,0.0,0.0,27003.64
847,-1,1176096,202613,6,12933.315,0.0,0.0,0.0,0.0,12933.32
1595,-1,2339778,249116,7,2414.970,0.0,0.0,0.0,0.0,2414.97
1964,0,1791022,241638,1,1.350,0.0,0.0,0.0,0.0,1.35
...,...,...,...,...,...,...,...,...,...,...
3836041,0,2552357,180541,32,5008.500,0.0,0.0,0.0,0.0,5008.50
3837265,0,1569731,109881,63,1118.835,0.0,0.0,0.0,0.0,1118.84
3837553,0,2580443,132203,69,17330.085,0.0,0.0,0.0,0.0,17330.08
3839884,0,1944847,114069,11,130.500,0.0,0.0,0.0,0.0,130.50


# Nettoyage de **`previous_application`**

Caractéristiques :
- $1\,670\,214$ enregistrements.
- $37$ variables, dont $2$ **`SK`**, $1$ **`FLAG`**, $2$ **`NFLAG`**, $11$ **`NAME`**, $6$ **`DAYS`**, $1$ **`CNT`**, $5$ **`AMT`**, $1$ **`CODE`**, $1$ **`HOUR`**, $1$ **`PRODUCT`**, $3$ **`CODE`**, $1$ **`SELLERPLACE`**, $1$ **`WEEKDAY`**.

Les opérations de nettoyage ci-après nous permettent de diminuer l'empreinte mémoire de 471.5 MB à 433.5 MB et le nombre d'enregistrements est inchangé.

## Avant transformation

In [16]:
from home_credit.load import load_raw_table
from home_credit.utils import display_frame_basic_infos

data = load_raw_table("previous_application")
display_frame_basic_infos(data)
data.info()

load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\previous_application.pqt
[1mn_samples[0m: 1 670 214
[1mn_columns[0m: 37, [('SK', 2), ('FLAG', 1), ('NFLAG', 2), ('NAME', 11), ('DAYS', 6), ('CNT', 1), ('AMT', 5), ('CHANNEL', 1), ('CODE', 1), ('HOUR', 1), ('PRODUCT', 1), ('RATE', 3), ('SELLERPLACE', 1), ('WEEKDAY', 1)]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1670214 entries, 0 to 1670213
Data columns (total 37 columns):
 #   Column                       Non-Null Count    Dtype  
---  ------                       --------------    -----  
 0   SK_ID_PREV                   1670214 non-null  int64  
 1   SK_ID_CURR                   1670214 non-null  int64  
 2   NAME_CONTRACT_TYPE           1670214 non-null  object 
 3   AMT_ANNUITY                  1297979 non-null  float64
 4   AMT_APPLICATION              1670214 non-null  float64
 5   AMT_CREDIT                   1670213 non-null  float64
 6   AMT_DOWN_PAYMENT             774370 non-null   float64
 7   

In [19]:
from pepper.univar import print_value_counts_dict
from home_credit.tables import PreviousApplication

for col in PreviousApplication.cols_group("all_cats"):
    print_value_counts_dict(data, col)

NAME_CONTRACT_TYPE (4): {'Cash loans': 747553, 'Consumer loans': 729151, 'Revolving loans': 193164, 'XNA': 346}
NAME_CONTRACT_STATUS (4): {'Approved': 1036781, 'Canceled': 316319, 'Refused': 290678, 'Unused offer': 26436}
CODE_REJECT_REASON (9): {'XAP': 1353093, 'HC': 175231, 'LIMIT': 55680, 'SCO': 37467, 'CLIENT': 26436, 'SCOFR': 12811, 'XNA': 5244, 'VERIF': 3535, 'SYSTEM': 717}
NAME_CASH_LOAN_PURPOSE (25): {'XAP': 922661, 'XNA': 677918, 'Repairs': 23765, 'Other': 15608, 'Urgent needs': 8412, 'Buying a used car': 2888, 'Building a house or an annex': 2693, 'Everyday expenses': 2416, 'Medicine': 2174, 'Payments on other loans': 1931, 'Education': 1573, 'Journey': 1239, 'Purchase of electronic equipment': 1061, 'Buying a new car': 1012, 'Wedding / gift / holiday': 962, 'Buying a home': 865, 'Car repairs': 797, 'Furniture': 749, 'Buying a holiday home / land': 533, 'Business development': 426, 'Gasification / water supply': 300, 'Buying a garage': 136, 'Hobby': 55, 'Money for a third per

In [15]:
from pepper.univar import print_value_counts_dict
print_value_counts_dict(data, "FLAG_LAST_APPL_PER_CONTRACT")

FLAG_LAST_APPL_PER_CONTRACT (2): {'Y': 1661739, 'N': 8475}


## Après transformation

In [21]:
from home_credit.merge import targetize
from pepper.db_utils import cast_columns
from pepper.feat_eng import nullify
from home_credit.feat_eng import negate_numerical_data
from home_credit.tables import PreviousApplication
from home_credit.cols_map import get_group
from home_credit.cols_map import _reload_config
import numpy as np


_reload_config()


data = load_raw_table("previous_application")
targetize(data)


# Impute
# Correct the ages to be 0 instead of 365243
cols = PreviousApplication.cols_group("ages")
data[cols] = data[cols].replace(365_243, 0)
data[cols] = -data[cols]
# display(data[cols].min())
# display(data[cols].max())
# cast_columns(data, cols, np.uint16) présence de NA

# Encode
# FLAGS : ('Y', 'N') -> (0, 1)
cols = "FLAG_LAST_APPL_PER_CONTRACT"
to_replace = {"Y": 1, "N": 0}
data[cols] = data[cols].replace(to_replace)

# Downcasts
cast_columns(data, ["SK_ID_PREV", "SK_ID_CURR"], np.uint16)
flags_cols = ["FLAG_LAST_APPL_PER_CONTRACT", "NFLAG_LAST_APPL_IN_DAY"]
cast_columns(data, flags_cols, np.uint8)
cast_columns(data, "SELLERPLACE_AREA", np.uint16)

data.set_index(["SK_ID_CURR", "SK_ID_PREV"], inplace=True)
data.columns.name = "CLEAN_PREVIOUS_APPLICATION"


"""
cast_columns(data, ["SK_ID_PREV", "SK_ID_CURR"], np.uint32)
negate_numerical_data(data.MONTHS_BALANCE)
cast_columns(data, ["MONTHS_BALANCE"], np.uint8)
cast_columns(data, ["SK_DPD", "SK_DPD_DEF"], np.uint16)

data_na = data[data.isna().any(axis=1)]
data.dropna(inplace=True)
cast_columns(data, ["CNT_INSTALMENT", "CNT_INSTALMENT_FUTURE"], np.uint8)
"""

data.info()

load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\previous_application.pqt
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1670214 entries, (9733, 64415) to (64604, 59466)
Data columns (total 36 columns):
 #   Column                       Non-Null Count    Dtype  
---  ------                       --------------    -----  
 0   TARGET                       1670214 non-null  int8   
 1   NAME_CONTRACT_TYPE           1670214 non-null  object 
 2   AMT_ANNUITY                  1297979 non-null  float64
 3   AMT_APPLICATION              1670214 non-null  float64
 4   AMT_CREDIT                   1670213 non-null  float64
 5   AMT_DOWN_PAYMENT             774370 non-null   float64
 6   AMT_GOODS_PRICE              1284699 non-null  float64
 7   WEEKDAY_APPR_PROCESS_START   1670214 non-null  object 
 8   HOUR_APPR_PROCESS_START      1670214 non-null  int64  
 9   FLAG_LAST_APPL_PER_CONTRACT  1670214 non-null  uint8  
 10  NFLAG_LAST_APPL_IN_DAY       1670214 non-null  uin

# Nettoyage de **`application`**

Caractéristiques :
- $356\,255$ enregistrements.
- $122$ variables, dont ...

Les opérations de nettoyage ci-après nous permettent de diminuer l'empreinte mémoire de 331.6 MB à 203.3 MB et le nombre d'enregistrements de 356 255 à _.


Groupe **`AMT`** : 4 variables flottantes donnant les montants principaux dimensionnants : revenus, prêt, annuité, actifs. Taux de remplissage de 100 % sauf pour `AMT_ANNUITY` (36 NA) et `AMT_GOODS_PRICE` (278 NA).

Groupe **`FLAGS`** : 9 variables booléennes en y incluant `CODE_GENDER` (qui a un NA caché sous le code `XNA`, imputable). Taux de remplissage de 100 %.

Groupe **`NAME_TYPE`** : 8 variables catégorielles socio-économiques. Un maximum de 58 modalités pour `ORGANIZATION_TYPE`, et au plus 18 pour les autres. Taux de remplissage de 100 % sauf pour `NAME_TYPE_SUITE` (99,4 %, 2 203 NA) et `OCCUPATION_TYPE` (68,6 %, 111 996 NA).

Groupe **`FAMILY_SIZE`** : 2 variables `CNT` entières positives $\le 21$. Taux de remplissage de 100 % sauf pour `CNT_FAM_MEMBERS` qui a **2** NA, mais imputable partant de `CNT_CHILDREN`, `NAME_TYPE_SUITE` et `NAME_FAMILY_STATUS`.

Groupe **`DAYS`** : 5 variables `DAYS` à traiter comme les autres (inversion du signe, valeur 365 243). Taux de remplissage de 100 % sauf pour `DAYS_LAST_PHONE_CHANGE` qui a **1** NA!

Groupe **`OWN_CAR_AGE`** : une variable entière positive $\le 91$ qui forme un groupe à elle seule, car les valeurs sont aberrantes (décomptes parfois en mois, parfois en années). Taux de remplissage de 34 %. Lié à `FLAG_OWN_CAR`.

Groupe **`APPR_PROCESS_START`**, de même nature que **`AMT_REQ_CREDIT_BUREAU`** : 2 variables entières positives $\lt 24$. Taux de remplissage de 100 %, mais `WEEKDAY_APPR_PROCESS_START` ressort en `object`.

Groupe **`REGION`** : 3 variables, une variable mal calibrée ([0,000253; 0,072508]) qui donne un indice de population relative, et deux notes entiers, l'une dans [1, 3], l'autre dans [-1, 3]. Taux de remplissage de 100 %.

Groupe **`LIVE_WORK_FLAGS`** : 6 booléens indiquant si le demandeur travaille et loge dans la même région, la même ville. Taux de remplissage de 100%.

Groupe **`EXT_SOURCE`** : 3 scores des clients, sans information sur la source ni la méthode de construction. Ces scores font partie des variables les plus déterminantes de l'évaluation du client. Taux de remplissage respectifs 45,6 %, 99,8 %, 80,5 %.

Groupe **`CNT_CIRCLE`** : 47 indices (moyenne, médiane, modes) de tendance centrale concernant le lieu d'habitation du demandeur, et concernant 19 indicateurs. Taux de remplissage variables, allant de 30 à 50 %.

Groupe **`CNT_CIRCLE`** : 4 variables entières positives, $\le 34$ pour les `DEF`, et $\le 354$ pour les `OBS`. Taux de remplissage conjoint de 99,7 %, avec 1 050 NA.

Groupe **`FLAG_DOCUMENT`** : 20 variables booléennes, taux de remplissage de 100%.

Groupe **`AMT_REQ_CREDIT_BUREAU`** : 6 variables entières positives, dont 5 sont $\le 27$ et une dont le maximum est 261. Taux de remplissage conjoint de 86,6 % (47 568 NA).

**TODO** Pour le tableau des conversions, faire une fonction qui le produit automatiquement.


## Avant transformation

In [64]:
from home_credit.load import get_table #, load_raw_table
from home_credit.utils import display_frame_basic_infos

# TODO : faire un dossier de persistance dans tmp qui complète
# dataset et qui est alimenté depuis le cache
# data = load_raw_table("application")
data = get_table("application")
display_frame_basic_infos(data)
data.info()

[1mn_samples[0m: 356 255
[1mn_columns[0m: 122, [('SK', 1), ('FLAG', 28), ('NAME', 6), ('DAYS', 5), ('CNT', 2), ('AMT', 10), ('APARTMENTS', 3), ('BASEMENTAREA', 3), ('CODE', 1), ('COMMONAREA', 3), ('DEF', 2), ('ELEVATORS', 3), ('EMERGENCYSTATE', 1), ('ENTRANCES', 3), ('EXT', 3), ('FLOORSMAX', 3), ('FLOORSMIN', 3), ('FONDKAPREMONT', 1), ('HOUR', 1), ('HOUSETYPE', 1), ('LANDAREA', 3), ('LIVE', 2), ('LIVINGAPARTMENTS', 3), ('LIVINGAREA', 3), ('NONLIVINGAPARTMENTS', 3), ('NONLIVINGAREA', 3), ('OBS', 2), ('OCCUPATION', 1), ('ORGANIZATION', 1), ('OWN', 1), ('REG', 4), ('REGION', 3), ('TARGET', 1), ('TOTALAREA', 1), ('WALLSMATERIAL', 1), ('WEEKDAY', 1), ('YEARS', 6)]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356255 entries, 0 to 356254
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 331.6+ MB


In [27]:
name_cols = list(data.columns[data.columns.str.startswith("NAME")])
name_cols.extend(["CODE_GENDER", "OCCUPATION_TYPE", "ORGANIZATION_TYPE"])
for col in name_cols:
    vc_dict = data[col].value_counts(dropna=False).to_dict()
    print(f"{col} ({len(vc_dict)}): {vc_dict}")

NAME_CONTRACT_TYPE (2): {'Cash loans': 326537, 'Revolving loans': 29718}
NAME_TYPE_SUITE (8): {'Unaccompanied': 288253, 'Family': 46030, 'Spouse, partner': 12818, 'Children': 3675, None: 2203, 'Other_B': 1981, 'Other_A': 975, 'Group of people': 320}
NAME_INCOME_TYPE (8): {'Working': 183307, 'Commercial associate': 83019, 'Pensioner': 64635, 'State servant': 25235, 'Unemployed': 23, 'Student': 20, 'Businessman': 11, 'Maternity leave': 5}
NAME_EDUCATION_TYPE (5): {'Secondary / secondary special': 252379, 'Higher education': 87379, 'Incomplete higher': 12001, 'Lower secondary': 4291, 'Academic degree': 205}
NAME_FAMILY_STATUS (6): {'Married': 228715, 'Single / not married': 52480, 'Civil marriage': 34036, 'Separated': 22725, 'Widow': 18297, 'Unknown': 2}
NAME_HOUSING_TYPE (6): {'House / apartment': 316513, 'With parents': 17074, 'Municipal apartment': 12800, 'Rented apartment': 5599, 'Office apartment': 3024, 'Co-op apartment': 1245}
CODE_GENDER (3): {'F': 235126, 'M': 121125, 'XNA': 4}
O

## Après transformation

Première partie : traitement des NA et autres aberrations

In [74]:
from home_credit.tables import Application

data = get_table("application").copy()

# la table `application` est nativement targetisée et currentisée.

# Imputation des NA et d'autres valeurs aberrantes

# Un cas unique (408583) non entier
cols = "DAYS_REGISTRATION"
data[cols] = data[cols].round()

# Un cas unique (118330)
cols = "DAYS_LAST_PHONE_CHANGE"
data[cols] = data[cols].fillna(0)

# Deux cas (148605 et 317181)
cols = "CNT_FAM_MEMBERS"
data[cols] = data[cols].fillna(1)

# La durée d'emploi des retraités tombe à 0 et non à 365243
cols = "DAYS_EMPLOYED"
data[cols] = data[cols].replace(365_243, 0)

# L'unique client (224393) de la plus petite région par la population
# est le cas unique de REGION_RATING_CLIENT_W_CITY à -1
# Ce score peut être rectifié à 2 (voir la justification dans assert.ipynb)
# **Note** la discussion sur ce cas apporte des résultats intéressants à propos des régions
cols = "REGION_RATING_CLIENT_W_CITY"
data[cols] = data[cols].replace(-1, 2)

# SOCIAL_CIRCLE, on remplace les 4 NA par leur équivalent, 4 zéros.
cols = Application.cols_group("social_circle_counts")
data[cols] = data[cols].fillna(0)

# Codage à -1 des NA de AMT_REQ_CREDIT_BUREAU le temps des pré-traitements,
# pour éviter du `float64`, mais il ne faudra pas oublier de
# retraiter avant de passer à l'entraînement des modèles.
cols = Application.cols_group("credit_bureau_request_counts")
data[cols] = data[cols].fillna(-1)



# Suppression des NA qui rendent les données inexploitables :
# - absence du montant de l'annuité (`AMT_ANNUITY`, 36 cas)
# - absence du montant de l'actif (`AMT_GOODS_PRICE`, 278 cas)
cols = ["AMT_ANNUITY", "AMT_GOODS_PRICE"]
data.dropna(subset=cols, inplace=True)
# data_na = data[data.isna().any(axis=1)]
# data.dropna(inplace=True)

# TODO Suppression à étudier des données SOCIAL_CIRCLE qui semblent plutôt introduire du bruit
# TODO Idem avec HOUSING_STATS

# Repasser les XNA à np.nan n'a pas de grand intérêt pour la suite
# sinon uniformiser la représentation des NA
# nullify(data.NAME_CONTRACT_STATUS, "XNA")

display(data.isna().sum().sum())  # Pourquoi 10256892 : principalement ceux des EXT_SRC et des STATS
display(data.isna().sum(axis=0))
display(data.isna().sum(axis=1))
display(data)

10256892

application
SK_ID_CURR                    0
TARGET                        0
NAME_CONTRACT_TYPE            0
CODE_GENDER                   0
FLAG_OWN_CAR                  0
                             ..
AMT_REQ_CREDIT_BUREAU_DAY     0
AMT_REQ_CREDIT_BUREAU_WEEK    0
AMT_REQ_CREDIT_BUREAU_MON     0
AMT_REQ_CREDIT_BUREAU_QRT     0
AMT_REQ_CREDIT_BUREAU_YEAR    0
Length: 122, dtype: int64

0          1
1          2
2         48
3         50
4         50
          ..
356250    50
356251    50
356252    20
356253    20
356254    48
Length: 355941, dtype: int64

application,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,...,LIVINGAPARTMENTS_MEDI,LIVINGAREA_MEDI,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAREA_MEDI,FONDKAPREMONT_MODE,HOUSETYPE_MODE,TOTALAREA_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,351000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.018801,-9461,-637,-3648.0,-2120,,1,1,0,1,1,0,Laborers,1.0,2,2,WEDNESDAY,10,0,0,0,0,0,0,...,0.0205,0.0193,0.0000,0.0000,reg oper account,block of flats,0.0149,"Stone, brick",No,2.0,2.0,2.0,2.0,-1134.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,1129500.0,Family,State servant,Higher education,Married,House / apartment,0.003541,-16765,-1188,-1186.0,-291,,1,1,0,1,1,0,Core staff,2.0,1,1,MONDAY,11,0,0,0,0,0,0,...,0.0787,0.0558,0.0039,0.0100,reg oper account,block of flats,0.0714,Block,No,1.0,0.0,1.0,0.0,-828.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,135000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.010032,-19046,-225,-4260.0,-2531,26.0,1,1,1,1,1,0,Laborers,1.0,2,2,MONDAY,9,0,0,0,0,0,0,...,,,,,,,,,,0.0,0.0,0.0,0.0,-815.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,297000.0,Unaccompanied,Working,Secondary / secondary special,Civil marriage,House / apartment,0.008019,-19005,-3039,-9833.0,-2437,,1,1,0,1,0,0,Laborers,2.0,2,2,WEDNESDAY,17,0,0,0,0,0,0,...,,,,,,,,,,2.0,0.0,2.0,0.0,-617.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,513000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.028663,-19932,-3038,-4311.0,-3458,,1,1,0,1,0,0,Core staff,1.0,2,2,THURSDAY,11,0,0,0,0,1,1,...,,,,,,,,,,0.0,0.0,0.0,0.0,-1106.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
356250,456221,-1,Cash loans,F,N,Y,0,121500.0,412560.0,17473.5,270000.0,Unaccompanied,Working,Secondary / secondary special,Widow,House / apartment,0.002042,-19970,-5169,-9094.0,-3399,,1,1,1,1,1,0,,1.0,3,3,WEDNESDAY,16,0,0,0,0,0,0,...,,,,,,,,,,1.0,0.0,1.0,0.0,-684.0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
356251,456222,-1,Cash loans,F,N,N,2,157500.0,622413.0,31909.5,495000.0,Unaccompanied,Commercial associate,Secondary / secondary special,Married,House / apartment,0.035792,-11186,-1149,-3015.0,-3003,,1,1,0,1,0,0,Sales staff,4.0,2,2,MONDAY,11,0,0,0,0,1,1,...,,,,,,,,,,2.0,0.0,2.0,0.0,0.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
356252,456223,-1,Cash loans,F,Y,Y,1,202500.0,315000.0,33205.5,315000.0,Unaccompanied,Commercial associate,Secondary / secondary special,Married,House / apartment,0.026392,-15922,-3037,-2681.0,-1504,4.0,1,1,0,1,1,0,,3.0,2,2,WEDNESDAY,12,0,0,0,0,0,0,...,,0.1408,,0.0554,,block of flats,0.1663,"Stone, brick",No,0.0,0.0,0.0,0.0,-838.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,3.0,1.0
356253,456224,-1,Cash loans,M,N,N,0,225000.0,450000.0,25128.0,450000.0,Family,Commercial associate,Higher education,Married,House / apartment,0.018850,-13968,-2731,-1461.0,-1364,,1,1,1,1,1,0,Managers,2.0,2,2,MONDAY,10,0,1,1,0,1,1,...,,0.1591,,0.1521,,block of flats,0.1974,Panel,No,0.0,0.0,0.0,0.0,-2308.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,2.0


Retraitements autres que les NA

In [72]:
from home_credit.cols_map import _current_directory, _load_config_from_json
import home_credit.cols_map
import os
from home_credit.tables import Application

home_credit.cols_map._cols_map_config = _load_config_from_json(
    os.path.join(_current_directory, "cols_map.json")
)

print(Application.cols_group("ages"))


['DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'DAYS_LAST_PHONE_CHANGE']


In [76]:
from pepper.utils import get_weekdays
from pepper.db_utils import cast_columns
# from home_credit.feat_eng import negate_numerical_data
from home_credit.tables import Application

# On inverse le signe des 5 variables `DAYS`
cols = Application.cols_group("ages")
data[cols] = -data[cols]

# Encodage du jour de la semaine en tant qu'entier indice partant de MONDAY=0
cols = "WEEKDAY_APPR_PROCESS_START"
to_replace = {d: i for i, d in enumerate(get_weekdays())}
data[cols] = data[cols].replace(to_replace)

# On bascule tous les FLAGS (y compris CODE_GENDER) en numérique
# Pour le CODE GENDER, M = 0, F = 1, XNA = 2
cols = "CODE_GENDER"
to_replace = {"M": 0, "F": 1, "XNA": 2}
data[cols] = data[cols].replace(to_replace)

# Pour les FLAGS, Y vaut 1 et N vaut 0
cols = Application.cols_group("ownership_flags")
to_replace = {"Y": 1, "N": 0}
data[cols] = data[cols].replace(to_replace)

Casts :

In [77]:
import numpy as np
from home_credit.tables import Application

# On attaque les casts, normalement rendus possibles en l'absence de NA
def cast_cols_group(data, group_name, dtype):
    cast_columns(data, Application.cols_group(group_name), dtype)

cast_cols_group(data, "target", np.uint8)
cast_cols_group(data, "keys", np.uint32)
cast_cols_group(data, "gender", np.uint8)
cast_cols_group(data, "financial_statement", np.float32)
cast_cols_group(data, "contact_flags", np.uint8)
cast_cols_group(data, "commute_flags", np.uint8)
cast_cols_group(data, "ownership_flags", np.uint8)

# TODO Permet d'expliciter ce type de groupe abstrait dans col_maps
flag_doc_cols = list(data.columns[data.columns.str.startswith("FLAG_DOCUMENT")])
cast_columns(data, flag_doc_cols, np.uint8)

cast_cols_group(data, "family_counts", np.uint8)
cast_cols_group(data, "process_start", np.uint8)
cast_cols_group(data, "region_ratings", np.uint8)

# days_cols = list(data.columns[data.columns.str.startswith("DAYS")])
cast_cols_group(data, "ages", np.uint16)

# cast_columns(data, ["OWN_CAR_AGE"], np.uint8)

# cnt_req_cols = list(data.columns[data.columns.str.startswith("AMT_REQ")])
cnt_req_cols = Application.cols_group("credit_bureau_request_counts")
cnt_req_cols.remove("AMT_REQ_CREDIT_BUREAU_QRT")
cast_columns(data, cnt_req_cols, np.int8)
cast_columns(data, "AMT_REQ_CREDIT_BUREAU_QRT", np.int16)

data.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 355941 entries, 0 to 356254
Data columns (total 122 columns):
 #    Column                        Dtype  
---   ------                        -----  
 0    SK_ID_CURR                    uint32 
 1    TARGET                        uint8  
 2    NAME_CONTRACT_TYPE            object 
 3    CODE_GENDER                   uint8  
 4    FLAG_OWN_CAR                  uint8  
 5    FLAG_OWN_REALTY               uint8  
 6    CNT_CHILDREN                  uint8  
 7    AMT_INCOME_TOTAL              float32
 8    AMT_CREDIT                    float32
 9    AMT_ANNUITY                   float32
 10   AMT_GOODS_PRICE               float32
 11   NAME_TYPE_SUITE               object 
 12   NAME_INCOME_TYPE              object 
 13   NAME_EDUCATION_TYPE           object 
 14   NAME_FAMILY_STATUS            object 
 15   NAME_HOUSING_TYPE             object 
 16   REGION_POPULATION_RELATIVE    float64
 17   DAYS_BIRTH                    uint16 
 18   DA

### Fonctions intégrées

**C'est la version préliminaire, car on tend vers du tout object + toutes les règles de transformation chargées depuis des fichiers JSON.**

### **`get_clean_application`**

Extraction de la table nettoyée.

In [2]:
from home_credit.clean_up import get_clean_application

data = get_clean_application()
data.info()
display(data)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356255 entries, 0 to 356254
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float32(4), float64(52), int16(1), int8(5), object(12), uint16(5), uint32(1), uint8(42)
memory usage: 200.8+ MB


application,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,0,0,1,0,202500.0,406597.5,24700.5,...,0,0,0,0,0,0,0,0,0,1
1,100003,0,Cash loans,1,0,0,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0,0,0,0,0,0
2,100004,0,Revolving loans,0,1,1,0,67500.0,135000.0,6750.0,...,0,0,0,0,0,0,0,0,0,0
3,100006,0,Cash loans,1,0,1,0,135000.0,312682.5,29686.5,...,0,0,0,0,-1,-1,-1,-1,-1,-1
4,100007,0,Cash loans,0,0,1,0,121500.0,513000.0,21865.5,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
356250,456221,255,Cash loans,1,0,1,0,121500.0,412560.0,17473.5,...,0,0,0,0,0,0,0,0,0,1
356251,456222,255,Cash loans,1,0,0,2,157500.0,622413.0,31909.5,...,0,0,0,0,-1,-1,-1,-1,-1,-1
356252,456223,255,Cash loans,1,1,1,1,202500.0,315000.0,33205.5,...,0,0,0,0,0,0,0,0,3,1
356253,456224,255,Cash loans,0,0,0,0,225000.0,450000.0,25128.0,...,0,0,0,0,0,0,0,0,0,2
