# Initial EDA and Aggregation process

We are doing an initial EDA and the aggregation at the case_id level for the following files:
- train_debitcard
- train_deposit
- train_other
- train_person_1
- train_person_2

**Columns**

Special columns:

- case_id - This is the unique identifier for each credit case. You'll need this ID to join relevant tables to the base table.
- date_decision - This refers to the date when a decision was made regarding the approval of the loan.
- WEEK_NUM - This is the week number used for aggregation. In the test sample, WEEK_NUM continues sequentially from the last training value of WEEK_NUM.
- MONTH - This column represents the month and is intended for aggregation purposes.
- target - This is the target value, determined after a certain period based on whether or not the client defaulted on the specific credit case (loan).
- num_group1 - This is an indexing column used for the historical records of case_id in both depth=1 and depth=2 tables.
- num_group2 - This is the second indexing column for depth=2 tables' historical records of case_id. The order of num_group1 and num_group2 is important and will be clarified in feature definitions.

All other raw columns in the tables serve as predictors. Their definitions can be found in the file feature_definitions.csv. For depth=0 tables, predictors can be directly used as features. However, for tables with depth>0, you may need to employ aggregation functions that will condense the historical records associated with each case_id into a single feature. In case num_group1 or num_group2 stands for person index (this is clear with predictor definitions) the zero index has special meaning. When num_groupN=0 it is the applicant (the person who applied for a loan).

**Table depth**

depth=1 - Each case_id has an associated historical record, indexed by num_group1.

depth=2 - Each case_id has an associated historical record, indexed by both num_group1 and num_group2.

In [2]:
import polars as pl
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
#from src.utils.merge import group_file_data

## Load the data

In [3]:
# Substitute this by the import 

def group_file_data(
    df: pl.DataFrame, 
    num_cols: list[str] = [], 
    date_cols: list[str] = [], 
    cat_cols: list[str] = []
) -> pl.DataFrame:
    '''
    Function to group numerical, date, and categorical columns

    Parameters:
    -----------
    df : Polars DataFrame
    num_cols : List of numerical column names (remember to drop num_group columns)
    date_cols : List of date column names
    cat_cols : List of categorical column names (becomes dummies)
    '''
    
    # Convert date columns
    df_date = df[['case_id'] + date_cols].with_columns([ pl.col(col).str.to_date() for col in date_cols ])

    # One-hot categories
    df_dummies = df[['case_id'] + cat_cols].to_dummies(cat_cols)

    # Num DataFrame
    df_num = df[['case_id'] + num_cols]

    # Date aggs
    date_aggs = [ pl.min(col).name.suffix('_min') for col in date_cols ] +\
                [ pl.max(col).name.suffix('_max') for col in date_cols ] +\
                [ pl.n_unique(col).name.suffix('_distinct') for col in date_cols]
    df_date_grouped = df_date.group_by('case_id').agg(date_aggs)

    # One-hot aggs
    dummy_cols = [ col for col in df_dummies.columns if col != 'case_id']
    dummies_aggs = [ pl.sum(col).name.suffix('_sum') for col in dummy_cols ]
    df_dummies_grouped = df_dummies.group_by('case_id').agg(dummies_aggs)

    # Numerical aggs
    num_aggs = [ pl.min(col).name.suffix('_min') for col in num_cols ] +\
            [ pl.max(col).name.suffix('_max') for col in num_cols ] +\
            [ pl.mean(col).name.suffix('_mean') for col in num_cols ] +\
            [ pl.median(col).name.suffix('_median') for col in num_cols ] +\
            [ pl.sum(col).name.suffix('_sum') for col in num_cols ]
    df_num_grouped = df_num.group_by('case_id').agg(num_aggs)

    # Join DataFrames
    df_joined = df_num_grouped.join(df_date_grouped, on='case_id')
    df_joined = df_joined.join(df_dummies_grouped, on='case_id')

    return df_joined

In [4]:
dataPath = 'C:/Users/laura/OneDrive/Documentos/Personal Documents/Universidad/DSE CCNY/Courses Semester 2/Applied ML/Project_final/home-credit-credit-risk-model-stability/'

In [5]:
def set_table_dtypes(df: pl.DataFrame) -> pl.DataFrame:
    # implement here all desired dtypes for tables
    # the following is just an example
    for col in df.columns:
        # last letter of column name will help you determine the type
        if col[-1] in ("P", "A"):
            df = df.with_columns(pl.col(col).cast(pl.Float64).alias(col))

    return df

def convert_strings(df: pd.DataFrame) -> pd.DataFrame:
    for col in df.columns:  
        if df[col].dtype.name in ['object', 'string']:
            df[col] = df[col].astype("string").astype('category')
            current_categories = df[col].cat.categories
            new_categories = current_categories.to_list() + ["Unknown"]
            new_dtype = pd.CategoricalDtype(categories=new_categories, ordered=True)
            df[col] = df[col].astype(new_dtype)
    return df

In [8]:
train_basetable = pl.read_csv(dataPath + "csv_files/train/train_base.csv")

In [6]:
#  depth=1
train_debitcard = pl.read_csv(dataPath + "csv_files/train/train_debitcard_1.csv").pipe(set_table_dtypes)
#  depth=1
train_deposit = pl.read_csv(dataPath + "csv_files/train/train_deposit_1.csv").pipe(set_table_dtypes)
#  depth=1
train_other = pl.read_csv(dataPath + "csv_files/train/train_other_1.csv").pipe(set_table_dtypes)
#  depth=1
train_person_1 = pl.read_csv(dataPath + "csv_files/train/train_person_1.csv").pipe(set_table_dtypes)
#  depth=2
train_person_2 = pl.read_csv(dataPath + "csv_files/train/train_person_2.csv").pipe(set_table_dtypes)

In [88]:
# Convert Polars DataFrame to pandas DataFrame for the inital EDA checks only
train_debitcard_pd = train_debitcard.to_pandas()
train_deposit_pd = train_deposit.to_pandas()
train_other_pd = train_other.to_pandas()
train_person_1_pd = train_person_1.to_pandas()
train_person_2_pd = train_person_2.to_pandas()

## Debit Card

In [111]:
def initial_check(pandas_df):
  display(pandas_df.head())
  print('shape: ', pandas_df.shape)
  print('columns: ', pandas_df.columns)
  print('info: ', pandas_df.info())
  print('describe: ', pandas_df.describe())

In [72]:
initial_check(train_debitcard_pd)

Unnamed: 0,case_id,last180dayaveragebalance_704A,last180dayturnover_1134A,last30dayturnover_651A,num_group1,openingdate_857D
0,225,,,,0,2016-08-16
1,331,,,,0,2015-03-19
2,358,,,,0,2014-09-02
3,390,,,,0,2014-07-23
4,390,,,,2,2016-06-08


shape:  (157302, 6)
columns:  Index(['case_id', 'last180dayaveragebalance_704A', 'last180dayturnover_1134A',
       'last30dayturnover_651A', 'num_group1', 'openingdate_857D'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157302 entries, 0 to 157301
Data columns (total 6 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   case_id                        157302 non-null  int64  
 1   last180dayaveragebalance_704A  12216 non-null   float64
 2   last180dayturnover_1134A       11081 non-null   float64
 3   last30dayturnover_651A         11081 non-null   float64
 4   num_group1                     157302 non-null  int64  
 5   openingdate_857D               144591 non-null  object 
dtypes: float64(3), int64(2), object(1)
memory usage: 7.2+ MB
info:  None
describe:              case_id  last180dayaveragebalance_704A  last180dayturnover_1134A  \
count  1.573020e+05                   

last180dayaveragebalance_704A: Average balance on debit card in the last 180 days.

last180dayturnover_1134A: Debit card's turnover within the last 180 days.

last30dayturnover_651A: Debit card turnover for the last 30 days.

openingdate_857D: Debit card opening date.

Check if, for the same case_id we can have multiple num_group1

In [102]:
# Check if for the same case_id we can have multiple num_group1 
# Group by 'case_id' and count unique 'num_group1' values
unique_counts = train_debitcard_pd.groupby('case_id')['num_group1'].nunique()

# Check if any case_id has more than one unique 'num_group1' value
multiple_num_group1 = unique_counts[unique_counts > 1]

if not multiple_num_group1.empty:
    print("Some case_ids have multiple num_group1 values:")
    print(multiple_num_group1)
else:
    print("Each case_id has a unique num_group1 value.")

Some case_ids have multiple num_group1 values:
case_id
390        3
445        5
453        3
731        3
739        2
          ..
2703401    2
2703407    2
2703416    2
2703430    9
2703453    2
Name: num_group1, Length: 29309, dtype: int64


We checked the null values:

8% of the rows contains information 

In [106]:
initial_check(train_deposit_pd)

Unnamed: 0,case_id,amount_416A,contractenddate_991D,num_group1,openingdate_313D
0,225,0.0,,0,2016-08-16
1,331,260.374,2018-03-18,0,2015-03-19
2,358,0.0,,0,2014-09-02
3,390,211748.53,2017-07-22,0,2014-07-23
4,390,223.68001,,2,2016-06-08


shape:  (145086, 5)
columns:  Index(['case_id', 'amount_416A', 'contractenddate_991D', 'num_group1',
       'openingdate_313D'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145086 entries, 0 to 145085
Data columns (total 5 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   case_id               145086 non-null  int64  
 1   amount_416A           145086 non-null  float64
 2   contractenddate_991D  65404 non-null   object 
 3   num_group1            145086 non-null  int64  
 4   openingdate_313D      145086 non-null  object 
dtypes: float64(1), int64(2), object(2)
memory usage: 5.5+ MB
info:  None
describe:              case_id   amount_416A     num_group1
count  1.450860e+05  1.450860e+05  145086.000000
mean   1.466214e+06  8.422304e+03       0.522531
std    8.865290e+05  8.623212e+04       1.620954
min    2.250000e+02 -4.000000e+04       0.000000
25%    6.600410e+05  0.000000e+00       0.000

In [107]:
initial_check(train_other_pd)

Unnamed: 0,case_id,amtdebitincoming_4809443A,amtdebitoutgoing_4809440A,amtdepositbalance_4809441A,amtdepositincoming_4809444A,amtdepositoutgoing_4809442A,num_group1
0,43801,12466.601,12291.2,914.2,0.0,304.80002,0
1,43991,3333.4001,3273.4001,0.0,0.0,0.0,0
2,44001,10000.0,10000.0,0.0,0.0,0.0,0
3,44053,0.0,0.0,2586.4001,0.0,88.8,0
4,44130,63.8,60.8,0.0,0.0,0.0,0


shape:  (51109, 7)
columns:  Index(['case_id', 'amtdebitincoming_4809443A', 'amtdebitoutgoing_4809440A',
       'amtdepositbalance_4809441A', 'amtdepositincoming_4809444A',
       'amtdepositoutgoing_4809442A', 'num_group1'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51109 entries, 0 to 51108
Data columns (total 7 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   case_id                      51109 non-null  int64  
 1   amtdebitincoming_4809443A    51109 non-null  float64
 2   amtdebitoutgoing_4809440A    51109 non-null  float64
 3   amtdepositbalance_4809441A   51109 non-null  float64
 4   amtdepositincoming_4809444A  51109 non-null  float64
 5   amtdepositoutgoing_4809442A  51109 non-null  float64
 6   num_group1                   51109 non-null  int64  
dtypes: float64(5), int64(2)
memory usage: 2.7 MB
info:  None
describe:              case_id  amtdebitincoming_4809443A  a

In [108]:
initial_check(train_person_1_pd)

Unnamed: 0,case_id,birth_259D,birthdate_87D,childnum_185L,contaddr_district_15M,contaddr_matchlist_1032L,contaddr_smempladdr_334L,contaddr_zipcode_807M,education_927M,empl_employedfrom_271D,...,registaddr_district_1083M,registaddr_zipcode_184M,relationshiptoclient_415T,relationshiptoclient_642T,remitter_829L,role_1084L,role_993L,safeguarantyflag_411L,sex_738L,type_25L
0,0,1986-07-01,,,P88_18_84,False,False,P167_100_165,P97_36_170,2017-09-15,...,P88_18_84,P167_100_165,,,,CL,,True,F,PRIMARY_MOBILE
1,0,,,,a55475b1,,,a55475b1,a55475b1,,...,a55475b1,a55475b1,SPOUSE,,False,EM,,,,PHONE
2,0,,,,a55475b1,,,a55475b1,a55475b1,,...,a55475b1,a55475b1,COLLEAGUE,SPOUSE,False,PE,,,,PHONE
3,0,,,,a55475b1,,,a55475b1,a55475b1,,...,a55475b1,a55475b1,,COLLEAGUE,,PE,,,,PHONE
4,1,1957-08-01,,,P103_93_94,False,False,P176_37_166,P97_36_170,2008-10-29,...,P103_93_94,P176_37_166,,,,CL,,True,M,PRIMARY_MOBILE


shape:  (2973991, 37)
columns:  Index(['case_id', 'birth_259D', 'birthdate_87D', 'childnum_185L',
       'contaddr_district_15M', 'contaddr_matchlist_1032L',
       'contaddr_smempladdr_334L', 'contaddr_zipcode_807M', 'education_927M',
       'empl_employedfrom_271D', 'empl_employedtotal_800L',
       'empl_industry_691L', 'empladdr_district_926M', 'empladdr_zipcode_114M',
       'familystate_447L', 'gender_992L', 'housetype_905L', 'housingtype_772L',
       'incometype_1044T', 'isreference_387L', 'language1_981M',
       'mainoccupationinc_384A', 'maritalst_703L', 'num_group1',
       'personindex_1023L', 'persontype_1072L', 'persontype_792L',
       'registaddr_district_1083M', 'registaddr_zipcode_184M',
       'relationshiptoclient_415T', 'relationshiptoclient_642T',
       'remitter_829L', 'role_1084L', 'role_993L', 'safeguarantyflag_411L',
       'sex_738L', 'type_25L'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2973991 entries, 0 to 2973990
Data colu

In [109]:
initial_check(train_person_2_pd)

Unnamed: 0,case_id,addres_district_368M,addres_role_871L,addres_zip_823M,conts_role_79M,empls_economicalst_849M,empls_employedfrom_796D,empls_employer_name_740M,num_group1,num_group2,relatedpersons_role_762T
0,5,a55475b1,,a55475b1,a55475b1,a55475b1,,a55475b1,0,0,
1,6,P55_110_32,CONTACT,P10_68_40,P38_92_157,P164_110_33,,a55475b1,0,0,
2,6,P55_110_32,PERMANENT,P10_68_40,a55475b1,a55475b1,,a55475b1,0,1,
3,6,P204_92_178,CONTACT,P65_136_169,P38_92_157,P164_110_33,,a55475b1,1,0,OTHER_RELATIVE
4,6,P191_109_75,CONTACT,P10_68_40,P7_147_157,a55475b1,,a55475b1,1,1,OTHER_RELATIVE


shape:  (1643410, 11)
columns:  Index(['case_id', 'addres_district_368M', 'addres_role_871L',
       'addres_zip_823M', 'conts_role_79M', 'empls_economicalst_849M',
       'empls_employedfrom_796D', 'empls_employer_name_740M', 'num_group1',
       'num_group2', 'relatedpersons_role_762T'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1643410 entries, 0 to 1643409
Data columns (total 11 columns):
 #   Column                    Non-Null Count    Dtype 
---  ------                    --------------    ----- 
 0   case_id                   1643410 non-null  int64 
 1   addres_district_368M      1643410 non-null  object
 2   addres_role_871L          67674 non-null    object
 3   addres_zip_823M           1643410 non-null  object
 4   conts_role_79M            1643410 non-null  object
 5   empls_economicalst_849M   1643410 non-null  object
 6   empls_employedfrom_796D   5757 non-null     object
 7   empls_employer_name_740M  1643410 non-null  object
 8   num_group

## Aggregation and Merge to the Base table

This is the aggregation definition

In [None]:
# Aggregation functions for numerical columns
numerical_agg_funcs = {
    'min': 'min',
    'max': 'max',
    'mean': 'mean',
    'median': 'median',
    'sum': 'sum'
}

# Aggregation functions for categorical columns
categorical_agg_funcs = {
    'mode': lambda x: x.mode().iloc[0],  # Mode
    'one_hot_encoding': lambda x: x.sum()  # One-hot encoding with sum of counts
}

# Aggregation functions for date columns
date_agg_funcs = {
    'min': 'min',
    'max': 'max',
    'distinct_count': 'nunique'  # Count of distinct values
}

Using the group_file_data from utils.merge.py

### Debitcard Aggregation

In [7]:
df = train_debitcard
# Date columns
date_cols = [ df.columns[i] for i in range(len(df.columns)) if (df.columns[i].__contains__('dat')) and (df.dtypes[i] == pl.String) ]

# Categorical columns
cat_cols = [ df.columns[i] for i in range(len(df.columns)) if (df.columns[i] not in date_cols) and (df.dtypes[i] == pl.String) ]

# Numerical columns
ignore_cols = ['case_id', 'num_group1', 'num_group2']
num_cols = [ 
    df.columns[i] for i in range(len(df.columns)) 
    if (df.columns[i] not in date_cols) and (df.columns[i] not in cat_cols) and (df.columns[i] not in ignore_cols)
]

# Group data
df_debitcard_agg = group_file_data(df, num_cols, date_cols, cat_cols)

In [117]:
df_debitcard_agg.head()

df_debitcard_agg.write_parquet('../data/train_debitcard_grouped.parquet')

case_id,last180dayaveragebalance_704A_min,last180dayturnover_1134A_min,last30dayturnover_651A_min,last180dayaveragebalance_704A_max,last180dayturnover_1134A_max,last30dayturnover_651A_max,last180dayaveragebalance_704A_mean,last180dayturnover_1134A_mean,last30dayturnover_651A_mean,last180dayaveragebalance_704A_median,last180dayturnover_1134A_median,last30dayturnover_651A_median,last180dayaveragebalance_704A_sum,last180dayturnover_1134A_sum,last30dayturnover_651A_sum,openingdate_857D_min,openingdate_857D_max,openingdate_857D_distinct
i64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,date,date,u32
1935805,,,,,,,,,,,,,0.0,0.0,0.0,2016-04-22,2016-06-24,2
1377171,,,,,,,,,,,,,0.0,0.0,0.0,2016-10-19,2016-10-19,1
1697247,,,,,,,,,,,,,0.0,0.0,0.0,2014-11-20,2015-09-23,2
1773019,,,,,,,,,,,,,0.0,0.0,0.0,2016-01-02,2016-01-02,1
2556608,0.0,100000.0,100000.0,0.0,100000.0,100000.0,0.0,100000.0,100000.0,0.0,100000.0,100000.0,0.0,100000.0,100000.0,,,1


### Deposit Aggregation

In [8]:
df = train_deposit
# Date columns
date_cols = [ df.columns[i] for i in range(len(df.columns)) if (df.columns[i].__contains__('dat')) and (df.dtypes[i] == pl.String) ]

# Categorical columns
cat_cols = [ df.columns[i] for i in range(len(df.columns)) if (df.columns[i] not in date_cols) and (df.dtypes[i] == pl.String) ]

# Numerical columns
ignore_cols = ['case_id', 'num_group1', 'num_group2']
num_cols = [ 
    df.columns[i] for i in range(len(df.columns)) 
    if (df.columns[i] not in date_cols) and (df.columns[i] not in cat_cols) and (df.columns[i] not in ignore_cols)
]

# Group data
df_deposit_agg = group_file_data(df, num_cols, date_cols, cat_cols)

In [119]:
df_deposit_agg.head()

df_deposit_agg.write_parquet('../data/train_deposit_grouped.parquet')

case_id,amount_416A_min,amount_416A_max,amount_416A_mean,amount_416A_median,amount_416A_sum,contractenddate_991D_min,openingdate_313D_min,contractenddate_991D_max,openingdate_313D_max,contractenddate_991D_distinct,openingdate_313D_distinct
i64,f64,f64,f64,f64,f64,date,date,date,date,u32,u32
1931637,0.0,0.0,0.0,0.0,0.0,,2015-05-01,,2015-05-01,1,1
2647189,202.73601,202.73601,202.73601,202.73601,202.73601,,2017-06-23,,2017-06-23,1,1
2538009,0.0,268.676,134.338,134.338,268.676,2017-07-16,2014-07-17,2017-07-16,2016-03-07,2,2
1308071,276.402,276.402,276.402,276.402,276.402,2017-12-04,2013-12-05,2017-12-04,2013-12-05,1,1
1349610,271.78403,271.78403,271.78403,271.78403,271.78403,2018-12-22,2014-06-23,2018-12-22,2014-06-23,1,1


### Other Aggregation

In [9]:
df = train_other
# Date columns
date_cols = [ df.columns[i] for i in range(len(df.columns)) if (df.columns[i].__contains__('dat')) and (df.dtypes[i] == pl.String) ]

# Categorical columns
cat_cols = [ df.columns[i] for i in range(len(df.columns)) if (df.columns[i] not in date_cols) and (df.dtypes[i] == pl.String) ]

# Numerical columns
ignore_cols = ['case_id', 'num_group1', 'num_group2']
num_cols = [ 
    df.columns[i] for i in range(len(df.columns)) 
    if (df.columns[i] not in date_cols) and (df.columns[i] not in cat_cols) and (df.columns[i] not in ignore_cols)
]

# Group data
df_other_agg = group_file_data(df, num_cols, date_cols, cat_cols)

In [121]:
df_other_agg.head()

df_other_agg.write_parquet('../data/train_other_grouped.parquet')

case_id,amtdebitincoming_4809443A_min,amtdebitoutgoing_4809440A_min,amtdepositbalance_4809441A_min,amtdepositincoming_4809444A_min,amtdepositoutgoing_4809442A_min,amtdebitincoming_4809443A_max,amtdebitoutgoing_4809440A_max,amtdepositbalance_4809441A_max,amtdepositincoming_4809444A_max,amtdepositoutgoing_4809442A_max,amtdebitincoming_4809443A_mean,amtdebitoutgoing_4809440A_mean,amtdepositbalance_4809441A_mean,amtdepositincoming_4809444A_mean,amtdepositoutgoing_4809442A_mean,amtdebitincoming_4809443A_median,amtdebitoutgoing_4809440A_median,amtdepositbalance_4809441A_median,amtdepositincoming_4809444A_median,amtdepositoutgoing_4809442A_median,amtdebitincoming_4809443A_sum,amtdebitoutgoing_4809440A_sum,amtdepositbalance_4809441A_sum,amtdepositincoming_4809444A_sum,amtdepositoutgoing_4809442A_sum
i64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
196254,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,3.0
199255,2714.4001,2502.6,0.0,0.0,0.0,2714.4001,2502.6,0.0,0.0,0.0,2714.4001,2502.6,0.0,0.0,0.0,2714.4001,2502.6,0.0,0.0,0.0,2714.4001,2502.6,0.0,0.0,0.0
1794488,0.0,0.0,47087.0,0.0,386.0,0.0,0.0,47087.0,0.0,386.0,0.0,0.0,47087.0,0.0,386.0,0.0,0.0,47087.0,0.0,386.0,0.0,0.0,47087.0,0.0,386.0
2678909,0.0,0.0,0.0,0.0,2.6000001,0.0,0.0,0.0,0.0,2.6000001,0.0,0.0,0.0,0.0,2.6000001,0.0,0.0,0.0,0.0,2.6000001,0.0,0.0,0.0,0.0,2.6000001
2696713,0.0,0.0,0.0,0.0,431.0,0.0,0.0,0.0,0.0,431.0,0.0,0.0,0.0,0.0,431.0,0.0,0.0,0.0,0.0,431.0,0.0,0.0,0.0,0.0,431.0


### Person 1 Aggregation

According to the documentation: In case num_group1 or num_group2 stands for person index (this is clear with predictor definitions) the zero index has special meaning. When num_groupN=0 it is the applicant (the person who applied for a loan). We will filter on num_group1 = 0 so we do not aggregate multiple people info in the same case_id

In [10]:
# Here num_group1=0 has special meaning, it is the person who applied for the loan.
train_person_1_filtered = train_person_1.filter(
    pl.col("num_group1") == 0
).drop("num_group1")

In [11]:
df = train_person_1_filtered
# Date columns
date_cols = [ df.columns[i] for i in range(len(df.columns)) if (df.columns[i].__contains__('dat')) and (df.dtypes[i] == pl.String) ]

# Categorical columns
cat_cols = [ df.columns[i] for i in range(len(df.columns)) if (df.columns[i] not in date_cols) and (df.dtypes[i] == pl.String) ]

# Numerical columns
ignore_cols = ['case_id', 'num_group1', 'num_group2']
num_cols = [ 
    df.columns[i] for i in range(len(df.columns)) 
    if (df.columns[i] not in date_cols) and (df.columns[i] not in cat_cols) and (df.columns[i] not in ignore_cols)
]

# Group data
df_person_1_agg = group_file_data(df, num_cols, date_cols, cat_cols)

In [None]:
df_person_1_agg.head()

df_person_1_agg.write_parquet('../data/train_person_1_grouped.parquet')

### Person 2 Aggregation 

In [None]:
# Here num_group1=0 has special meaning, it is the person who applied for the loan.
train_person_2_filtered = train_person_2.filter(
    pl.col("num_group1") == 0
).drop("num_group1")

In [6]:
df = train_person_2_filtered
# Date columns
date_cols = [ df.columns[i] for i in range(len(df.columns)) if (df.columns[i].__contains__('dat')) and (df.dtypes[i] == pl.String) ]

# Categorical columns
cat_cols = [ df.columns[i] for i in range(len(df.columns)) if (df.columns[i] not in date_cols) and (df.dtypes[i] == pl.String) ]

# Numerical columns
ignore_cols = ['case_id', 'num_group1', 'num_group2']
num_cols = [ 
    df.columns[i] for i in range(len(df.columns)) 
    if (df.columns[i] not in date_cols) and (df.columns[i] not in cat_cols) and (df.columns[i] not in ignore_cols)
]

# Group data
df_person_2_agg = group_file_data(df, num_cols, date_cols, cat_cols)

In [9]:
df_person_2_agg.head()

df_person_2_agg.write_parquet('../data/train_person_2_grouped.parquet')


## Joining all the tables with the baseline

In [None]:
# Join all tables together.
data = train_basetable.join(
    df_debitcard_agg, how="left", on="case_id"
).join(
    df_deposit_agg, how="left", on="case_id"
).join(
    df_other_agg, how="left", on="case_id"
).join(
    df_person_1_agg, how="left", on="case_id"
).join(
    df_person_2_agg, how="left", on="case_id"
)