# Categorical Features Preprocessing

Given a dataset containing categorical features, some of which exhibit high cardinality and significant missing values, the challenge arises in effectively encoding these features without succumbing to the curse of dimensionality inherent in traditional one-hot encoding.


To mitigate the high dimensionality resulting from one-hot encoding, various dimensionality reduction methods can be applied:

- Label Encoding with Target Encoding/Smoothing: Transform categories into numerical values based on target variable statistics (e.g., mean target value per category). It introduces ordinality that is meaningful with respect to the target variable, potentially improving model performance. However, it is prone to overfitting if not properly regularized (smoothing helps mitigate this) and we do not understand the relationship between each category and the target
- Frequency or Count Encoding: Replace categories with their frequency or count within the dataset.
- Grouping/Rare Encoding: Group infrequent categories into a single 'Other' category to reduce dimensionality. It helps in handling overfitting by diminishing the impact of rare categories. However, it potentially loses valuable information by aggregating distinct categories into a single 'Other' group. Moreover, the choice of threshold for grouping can be arbitrary and may require domain knowledge or experimentation.
- Binary Encoding or Hashing: Binary encoding and hashing can be utilized to condense categorical features into a more compact representation. Binary Encoding: Convert categories into binary code, allocating one column for each binary digit.


We are trying the following strategy: 

## Frequency Encoding and Binary Encoding

- Frequency or Count Encoding: This method represents categories based on their frequency or count in the dataset, implicitly capturing the importance of each category in the data distribution.

- Binary Encoding: By converting categories into binary code, Binary Encoding dramatically reduces the dimensionality of high-cardinality features while preserving more information than traditional one-hot encoding.

Why this combination?: This pairing offers a pragmatic approach to handling high-cardinality categorical features. Frequency or Count Encoding provides a simple yet informative representation of categories, while Binary Encoding efficiently reduces dimensionality, making it suitable for large-scale datasets with limited computational resources.

We we will the following file to test it:
- train_person_1
- train_person_2

In [1]:
import polars as pl
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder

## Load the data

In [3]:
dataPath = 'C:/Users/laura/OneDrive/Documentos/Personal Documents/Universidad/DSE CCNY/Courses Semester 2/Applied ML/Project_final/home-credit-credit-risk-model-stability/'

In [4]:
def set_table_dtypes(df: pl.DataFrame) -> pl.DataFrame:
    # implement here all desired dtypes for tables
    # the following is just an example
    for col in df.columns:
        # last letter of column name will help you determine the type
        if col[-1] in ("P", "A"):
            df = df.with_columns(pl.col(col).cast(pl.Float64).alias(col))

    return df

In [7]:
train_basetable = pl.read_csv(dataPath + "csv_files/train/train_base.csv")

In [5]:
#  depth=1
train_person_1 = pl.read_csv(dataPath + "csv_files/train/train_person_1.csv").pipe(set_table_dtypes)
#  depth=2
train_person_2 = pl.read_csv(dataPath + "csv_files/train/train_person_2.csv").pipe(set_table_dtypes)

In [8]:
train_person_1.head()

case_id,birth_259D,birthdate_87D,childnum_185L,contaddr_district_15M,contaddr_matchlist_1032L,contaddr_smempladdr_334L,contaddr_zipcode_807M,education_927M,empl_employedfrom_271D,empl_employedtotal_800L,empl_industry_691L,empladdr_district_926M,empladdr_zipcode_114M,familystate_447L,gender_992L,housetype_905L,housingtype_772L,incometype_1044T,isreference_387L,language1_981M,mainoccupationinc_384A,maritalst_703L,num_group1,personindex_1023L,persontype_1072L,persontype_792L,registaddr_district_1083M,registaddr_zipcode_184M,relationshiptoclient_415T,relationshiptoclient_642T,remitter_829L,role_1084L,role_993L,safeguarantyflag_411L,sex_738L,type_25L
i64,str,str,f64,str,bool,bool,str,str,str,str,str,str,str,str,str,str,str,str,bool,str,f64,str,i64,f64,f64,f64,str,str,str,str,bool,str,str,bool,str,str
0,"""1986-07-01""",,,"""P88_18_84""",False,False,"""P167_100_165""","""P97_36_170""","""2017-09-15""","""MORE_FIVE""","""OTHER""","""P142_57_166""","""P167_100_165""","""MARRIED""",,,,"""SALARIED_GOVT""",,"""P10_39_147""",10800.0,,0,0.0,1.0,1.0,"""P88_18_84""","""P167_100_165""",,,,"""CL""",,True,"""F""","""PRIMARY_MOBILE…"
0,,,,"""a55475b1""",,,"""a55475b1""","""a55475b1""",,,,"""a55475b1""","""a55475b1""",,,,,,,"""a55475b1""",,,1,1.0,1.0,4.0,"""a55475b1""","""a55475b1""","""SPOUSE""",,False,"""EM""",,,,"""PHONE"""
0,,,,"""a55475b1""",,,"""a55475b1""","""a55475b1""",,,,"""a55475b1""","""a55475b1""",,,,,,,"""a55475b1""",,,2,2.0,4.0,5.0,"""a55475b1""","""a55475b1""","""COLLEAGUE""","""SPOUSE""",False,"""PE""",,,,"""PHONE"""
0,,,,"""a55475b1""",,,"""a55475b1""","""a55475b1""",,,,"""a55475b1""","""a55475b1""",,,,,,,"""a55475b1""",,,3,,5.0,,"""a55475b1""","""a55475b1""",,"""COLLEAGUE""",,"""PE""",,,,"""PHONE"""
1,"""1957-08-01""",,,"""P103_93_94""",False,False,"""P176_37_166""","""P97_36_170""","""2008-10-29""","""MORE_FIVE""","""OTHER""","""P49_46_174""","""P160_59_140""","""DIVORCED""",,,,"""SALARIED_GOVT""",,"""P10_39_147""",10000.0,,0,0.0,1.0,1.0,"""P103_93_94""","""P176_37_166""",,,,"""CL""",,True,"""M""","""PRIMARY_MOBILE…"


In [9]:
train_person_2.head()

case_id,addres_district_368M,addres_role_871L,addres_zip_823M,conts_role_79M,empls_economicalst_849M,empls_employedfrom_796D,empls_employer_name_740M,num_group1,num_group2,relatedpersons_role_762T
i64,str,str,str,str,str,str,str,i64,i64,str
5,"""a55475b1""",,"""a55475b1""","""a55475b1""","""a55475b1""",,"""a55475b1""",0,0,
6,"""P55_110_32""","""CONTACT""","""P10_68_40""","""P38_92_157""","""P164_110_33""",,"""a55475b1""",0,0,
6,"""P55_110_32""","""PERMANENT""","""P10_68_40""","""a55475b1""","""a55475b1""",,"""a55475b1""",0,1,
6,"""P204_92_178""","""CONTACT""","""P65_136_169""","""P38_92_157""","""P164_110_33""",,"""a55475b1""",1,0,"""OTHER_RELATIVE…"
6,"""P191_109_75""","""CONTACT""","""P10_68_40""","""P7_147_157""","""a55475b1""",,"""a55475b1""",1,1,"""OTHER_RELATIVE…"


### Person 1 

In [10]:
df = train_person_1
# Date columns
date_cols = [ df.columns[i] for i in range(len(df.columns)) if (df.columns[i].__contains__('dat')) and (df.dtypes[i] == pl.String) ]

# Categorical columns
cat_cols = [ df.columns[i] for i in range(len(df.columns)) if (df.columns[i] not in date_cols) and (df.dtypes[i] == pl.String) ]

print(cat_cols)

['birth_259D', 'contaddr_district_15M', 'contaddr_zipcode_807M', 'education_927M', 'empl_employedfrom_271D', 'empl_employedtotal_800L', 'empl_industry_691L', 'empladdr_district_926M', 'empladdr_zipcode_114M', 'familystate_447L', 'gender_992L', 'housetype_905L', 'housingtype_772L', 'incometype_1044T', 'language1_981M', 'maritalst_703L', 'registaddr_district_1083M', 'registaddr_zipcode_184M', 'relationshiptoclient_415T', 'relationshiptoclient_642T', 'role_1084L', 'role_993L', 'sex_738L', 'type_25L']


In [11]:
# Filtering on the categorical columns only 
train_person_1_cat = train_person_1.select(cat_cols)

# shape
train_person_1_cat.shape

(2973991, 24)

In [14]:
# Frequency encoding for each categorical column
for col in cat_cols:
    # Calculate frequency for each category in the column
    value_counts = train_person_1_cat.groupby(col).agg(pl.len().alias('count'))
    total_count = train_person_1_cat.height  # Use height for row count in Polars
    frequency = (value_counts.with_columns(
                 (value_counts['count'] / total_count).alias(f'{col}_freq')
                )
                .select([col, f'{col}_freq']))
    
    # Joining the frequency DataFrame back to the original DataFrame
    train_person_1_cat = train_person_1_cat.join(frequency, on=col, how='left')


  value_counts = train_person_1_cat.groupby(col).agg(pl.len().alias('count'))


In [15]:
# Initialize LabelEncoder
le = LabelEncoder()

# Binary encoding for each categorical column
for col in cat_cols:
    # Convert categories to integers using LabelEncoder from sklearn
    encoded_int = le.fit_transform(train_person_1_cat[col].to_numpy())

    # Convert the numpy array back to a Polars Series and rename it
    encoded_series = pl.Series(encoded_int).alias(f"{col}_int")

    # Add the integer encoded column to the DataFrame
    train_person_1_cat = train_person_1_cat.with_columns(encoded_series)

    # Calculate the maximum binary length
    max_binary_length = encoded_series.max().bit_length()

    # Create binary encoding directly
    for bit_position in range(max_binary_length):
        # Use bitwise operations directly within Polars
        bit_value = (encoded_series / (2 ** bit_position)).cast(pl.Int64) & 1
        train_person_1_cat = train_person_1_cat.with_columns(
            bit_value.alias(f"{col}_binary_{bit_position}")
        )

In [47]:
train_person_1_cat.head()

case_id,birth_259D,birthdate_87D,childnum_185L,contaddr_district_15M,contaddr_matchlist_1032L,contaddr_smempladdr_334L,contaddr_zipcode_807M,education_927M,empl_employedfrom_271D,empl_employedtotal_800L,empl_industry_691L,empladdr_district_926M,empladdr_zipcode_114M,familystate_447L,gender_992L,housetype_905L,housingtype_772L,incometype_1044T,isreference_387L,language1_981M,mainoccupationinc_384A,maritalst_703L,num_group1,personindex_1023L,persontype_1072L,persontype_792L,registaddr_district_1083M,registaddr_zipcode_184M,relationshiptoclient_415T,relationshiptoclient_642T,remitter_829L,role_1084L,role_993L,safeguarantyflag_411L,sex_738L,type_25L,…,registaddr_district_1083M_binary_2,registaddr_district_1083M_binary_3,registaddr_district_1083M_binary_4,registaddr_district_1083M_binary_5,registaddr_district_1083M_binary_6,registaddr_district_1083M_binary_7,registaddr_district_1083M_binary_8,registaddr_district_1083M_binary_9,registaddr_zipcode_184M_binary_0,registaddr_zipcode_184M_binary_1,registaddr_zipcode_184M_binary_2,registaddr_zipcode_184M_binary_3,registaddr_zipcode_184M_binary_4,registaddr_zipcode_184M_binary_5,registaddr_zipcode_184M_binary_6,registaddr_zipcode_184M_binary_7,registaddr_zipcode_184M_binary_8,registaddr_zipcode_184M_binary_9,registaddr_zipcode_184M_binary_10,registaddr_zipcode_184M_binary_11,relationshiptoclient_415T_binary_0,relationshiptoclient_415T_binary_1,relationshiptoclient_415T_binary_2,relationshiptoclient_415T_binary_3,relationshiptoclient_642T_binary_0,relationshiptoclient_642T_binary_1,relationshiptoclient_642T_binary_2,relationshiptoclient_642T_binary_3,role_1084L_binary_0,role_1084L_binary_1,role_993L_binary_0,sex_738L_binary_0,sex_738L_binary_1,type_25L_binary_0,type_25L_binary_1,type_25L_binary_2,type_25L_binary_3
i64,str,str,f64,str,bool,bool,str,str,str,str,str,str,str,str,str,str,str,str,bool,str,f64,str,i64,f64,f64,f64,str,str,str,str,bool,str,str,bool,str,str,…,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64
0,"""1986-07-01""",,,"""P88_18_84""",False,False,"""P167_100_165""","""P97_36_170""","""2017-09-15""","""MORE_FIVE""","""OTHER""","""P142_57_166""","""P167_100_165""","""MARRIED""",,,,"""SALARIED_GOVT""",,"""P10_39_147""",10800.0,,0,0.0,1.0,1.0,"""P88_18_84""","""P167_100_165""",,,,"""CL""",,True,"""F""","""PRIMARY_MOBILE…",…,1,1,1,0,0,1,1,1,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,1,0,1,0,0,1,0,0,0,0,1,0
0,,,,"""a55475b1""",,,"""a55475b1""","""a55475b1""",,,,"""a55475b1""","""a55475b1""",,,,,,,"""a55475b1""",,,1,1.0,1.0,4.0,"""a55475b1""","""a55475b1""","""SPOUSE""",,False,"""EM""",,,,"""PHONE""",…,1,1,1,0,1,1,1,1,0,1,0,1,0,0,1,1,1,0,1,1,1,0,0,1,0,1,0,1,1,0,1,0,1,0,1,0,0
0,,,,"""a55475b1""",,,"""a55475b1""","""a55475b1""",,,,"""a55475b1""","""a55475b1""",,,,,,,"""a55475b1""",,,2,2.0,4.0,5.0,"""a55475b1""","""a55475b1""","""COLLEAGUE""","""SPOUSE""",False,"""PE""",,,,"""PHONE""",…,1,1,1,0,1,1,1,1,0,1,0,1,0,0,1,1,1,0,1,1,1,0,0,0,1,0,0,1,0,1,1,0,1,0,1,0,0
0,,,,"""a55475b1""",,,"""a55475b1""","""a55475b1""",,,,"""a55475b1""","""a55475b1""",,,,,,,"""a55475b1""",,,3,,5.0,,"""a55475b1""","""a55475b1""",,"""COLLEAGUE""",,"""PE""",,,,"""PHONE""",…,1,1,1,0,1,1,1,1,0,1,0,1,0,0,1,1,1,0,1,1,0,1,0,1,1,0,0,0,0,1,1,0,1,0,1,0,0
1,"""1957-08-01""",,,"""P103_93_94""",False,False,"""P176_37_166""","""P97_36_170""","""2008-10-29""","""MORE_FIVE""","""OTHER""","""P49_46_174""","""P160_59_140""","""DIVORCED""",,,,"""SALARIED_GOVT""",,"""P10_39_147""",10000.0,,0,0.0,1.0,1.0,"""P103_93_94""","""P176_37_166""",,,,"""CL""",,True,"""M""","""PRIMARY_MOBILE…",…,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,1,0,0,1,0,1,0,1,0,1,0,0,1,1,0,0,0,1,0


In [16]:
# Drop the original categorical columns from the DataFrame
train_person_1_cat = train_person_1_cat.drop(cat_cols)

In [None]:
df_person_1_cat.write_parquet('../data/train_person_1_cat.parquet')

### Person 2 Procesing

In [17]:
df = train_person_2
# Date columns
date_cols = [ df.columns[i] for i in range(len(df.columns)) if (df.columns[i].__contains__('dat')) and (df.dtypes[i] == pl.String) ]

# Categorical columns
cat_cols = [ df.columns[i] for i in range(len(df.columns)) if (df.columns[i] not in date_cols) and (df.dtypes[i] == pl.String) ]

In [18]:
# Filtering on the categorical columns only 
train_person_2_cat = train_person_2.select(cat_cols)

# shape
train_person_2_cat.shape

(1643410, 8)

In [19]:
# Frequency encoding for each categorical column
for col in cat_cols:
    # Calculate frequency for each category in the column
    value_counts = train_person_2_cat.groupby(col).agg(pl.len().alias('count'))
    total_count = train_person_2_cat.height  # Use height for row count in Polars
    frequency = (value_counts.with_columns(
                 (value_counts['count'] / total_count).alias(f'{col}_freq')
                )
                .select([col, f'{col}_freq']))
    
    # Joining the frequency DataFrame back to the original DataFrame
    train_person_2_cat = train_person_2_cat.join(frequency, on=col, how='left')


  value_counts = train_person_2_cat.groupby(col).agg(pl.len().alias('count'))


In [20]:
# Initialize LabelEncoder
le = LabelEncoder()

# Binary encoding for each categorical column
for col in cat_cols:
    # Convert categories to integers using LabelEncoder from sklearn
    encoded_int = le.fit_transform(train_person_2_cat[col].to_numpy())

    # Convert the numpy array back to a Polars Series and rename it
    encoded_series = pl.Series(encoded_int).alias(f"{col}_int")

    # Add the integer encoded column to the DataFrame
    train_person_2_cat = train_person_2_cat.with_columns(encoded_series)

    # Calculate the maximum binary length
    max_binary_length = encoded_series.max().bit_length()

    # Create binary encoding directly
    for bit_position in range(max_binary_length):
        # Use bitwise operations directly within Polars
        bit_value = (encoded_series / (2 ** bit_position)).cast(pl.Int64) & 1
        train_person_2_cat = train_person_2_cat.with_columns(
            bit_value.alias(f"{col}_binary_{bit_position}")
        )

In [21]:
# Drop the original categorical columns from the DataFrame
train_person_2_cat = train_person_2_cat.drop(cat_cols)

In [9]:
df_person_2_cat.head()

df_person_2_cat.write_parquet('../data/train_person_2_cat.parquet')