# **Module `home_credit.merge`**

✔ Revue des **typehints** et des **docstrings**.

This module provides a set of functions for merging and aggregating Home Credit data tables.
It enables the creation of consolidated datasets based on different criteria, including SK_ID_CURR,
MONTHS_BALANCE, and NUM_INSTALMENT_NUMBER. The module offers flexibility in defining aggregation
methods and supports the inclusion or exclusion of unique rows in the aggregated data.

Functions:
- `map_bur_to_curr(sk_id_bur: pd.Series) -> pd.Series`:
- `currentize(sk_id_bur: pd.Series) -> pd.Series`:
    - Map `SK_ID_BUREAU` values to their corresponding `SK_ID_CURR` values,
    currentizing the bureau table.
- `map_curr_to_target(sk_id_curr: pd.Series) -> pd.Series:`
- `targetize(sk_id_curr: pd.Series) -> pd.Series`:
    - Map `SK_ID_CURR` values to their corresponding `TARGET` values in the main table.
- `get_unique_and_multi_index(table_name: str, prev_sk: str, curr_sk: str) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]`:
    - Return four dataframes containing unique `SK_ID_PREV` and `SK_ID_CURR`
    combinations, split between those that appear only once and those that
    appear multiple times in the original data.
- `curr_prev_uniqueness_report(unique_prev_idx: pd.DataFrame, multi_prev_idx: pd.DataFrame, unique_curr_idx: pd.DataFrame, multi_curr_idx: pd.DataFrame) -> None`:
    - Display a report on the uniqueness of `SK_ID_CURR` and `SK_ID_PREV` in the data.
- `ip_months_balance_builder(data: pd.DataFrame) -> pd.Series`:
    - Build the `MONTHS_BALANCE` column based on the provided data.
- `_combine_grouped_data(data_uniques, grouped_multis, pivot_col)`:
    - Combine grouped dataframes of unique and multi-PREV rows based on a specified pivot column.
- `_groupby_curr_pivot(table_name: str, pivot_col: str, months_balance_builder: callable = None, agg_dict: dict = None, include_uniques: bool = False) -> pd.DataFrame`:
    - Group rows by a combination of `SK_ID_CURR` and the specified pivot column
    and aggregate data based on the provided table.
- `groupby_curr_months(table_name: str, months_balance_builder: callable = None, agg_dict: dict = None, include_uniques: bool = False) -> pd.DataFrame`:
    - Group rows by a combination of `SK_ID_CURR` and `MONTHS_BALANCE`
    and aggregate data based on the provided table.
- `groupby_curr_num(table_name: str, months_balance_builder: callable = None, agg_dict: dict = None, include_uniques: bool = False) -> pd.DataFrame`:
    - Group rows by a combination of `SK_ID_CURR` and `NUM_INSTALMENT_NUMBER`
    and aggregate data based on the provided table.

This module simplifies the process of merging and aggregating Home Credit data
for analysis, providing flexibility and options for creating consolidated
datasets for various analytical needs.

# **`map_bur_to_curr`**`(sk_id_bur)`

In [None]:
from home_credit.load import get_table
from home_credit.merge import map_bur_to_curr
data = get_table("bureau_balance").copy()
display(map_bur_to_curr(data.SK_ID_BUREAU))

load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\bureau_balance.pqt
load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\bureau.pqt


0           380361
1           380361
2           380361
3           380361
4           380361
             ...  
27299920    101874
27299921    101874
27299922    101874
27299923    101874
27299924    101874
Name: SK_ID_BUREAU, Length: 27299925, dtype: object

# **`currentize`**`(data)`

Utilisations :
- **`ea_bureau_balance.ipynb`** : distribution par rapport à la cible

C'est juste un préalable nécessaire : pour targetiser, il faut d'abord currentiser : correspondances composées.

In [None]:
from home_credit.load import get_table
from home_credit.merge import currentize
data = get_table("bureau_balance").copy()
currentize(data)
display(data)

RAW_BUREAU_BALANCE,SK_ID_CURR,SK_ID_BUREAU,MONTHS_BALANCE,STATUS
0,380361,5715448,0,C
1,380361,5715448,-1,C
2,380361,5715448,-2,C
3,380361,5715448,-3,C
4,380361,5715448,-4,C
...,...,...,...,...
27299920,101874,5041336,-47,X
27299921,101874,5041336,-48,X
27299922,101874,5041336,-49,X
27299923,101874,5041336,-50,X


# **`map_curr_to_target`**`(sk_id_curr)`

In [6]:
from home_credit.load import get_table
from home_credit.merge import map_curr_to_target
data = get_table("bureau").copy()
display(map_curr_to_target(data.SK_ID_CURR))

0          0
1          0
2          0
3          0
4          0
          ..
1716423    1
1716424    0
1716425    0
1716426    0
1716427    0
Name: SK_ID_CURR, Length: 1716428, dtype: int64

# **`targetize`**`(data)`

Où ?
- `ea_bureau_balance.ipynb` : distribution par rapport à la cible
- `ea_bureau.ipynb` : distribution par rapport à la cible
- `ea_previous_application.ipynb` : corrélation avec la cible
- `nb_macros.py` : dans `get_labeled_datablock`, version complémentaire de l'utilitaire `get_datablock` qui assure que tout jeu de variables, de quelque table que ce soit, puissent être obtenues associées à leur label de classe.

Mais c'est surtout là que c'est complètement décrit : `ea_bureau_balance.ipynb` > Adjonction de `TARGET` à chacune des tables périphériques. **Note** Il y a une fonctions supplémentaires non mises en lib : `adj_target_and_report(table_name)` qui est une macro pour les notebooks. En revanche, une petite fonction `adj_target` de deux lignes serait bien utile pour le confort d'utilisation.

On la trouve aussi dans `_benchmark.py` car c'est typiquement une fonction critique que j'ai cherché à optimiser et pour laquelle j'ai mis plusieurs implémentations alternatives en compétition.

In [1]:
from home_credit.load import get_table
from home_credit.merge import currentize, targetize
data = get_table("bureau_balance").copy()
currentize(data)
targetize(data)
display(data)

load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\bureau_balance.pqt
load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\bureau.pqt
load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\application_train.pqt
load C:/Users/franc/Projects/pepper_credit_scoring_tool\dataset\pqt\application_test.pqt


RAW_BUREAU_BALANCE,TARGET,SK_ID_CURR,SK_ID_BUREAU,MONTHS_BALANCE,STATUS
0,0,380361,5715448,0,C
1,0,380361,5715448,-1,C
2,0,380361,5715448,-2,C
3,0,380361,5715448,-3,C
4,0,380361,5715448,-4,C
...,...,...,...,...,...
27299920,1,101874,5041336,-47,X
27299921,1,101874,5041336,-48,X
27299922,1,101874,5041336,-49,X
27299923,1,101874,5041336,-50,X


# **`_get_unique_and_multi_index`**`(table, subs_sk, main_sk)`

# **`get_unique_and_multi_index`**`(table_name, prev_sk, curr_sk)`

# **`curr_prev_uniqueness_report`**`(unique_prev_idx, multi_prev_idx, unique_curr_idx, multi_curr_idx)`

# **`_groupby_curr_pivot`**`(table_name, pivot_col, months_balance_builder, agg_dict, include_uniques)`


# **`groupby_curr_months`**`(table_name, months_balance_builder, agg_dict, include_uniques)`

# **`groupby_curr_num`**`(table_name, months_balance_builder, agg_dict, include_uniques)`