## OOP for Data Preprocessing

Last time when I played with OOP for ML, I created a all-in-one classifier that can train, tune and explain multiple ML algorithms ([notebook here](https://github.com/MenaWANG/ML_toy_examples/blob/main/modeling%20algorithm/OOP%20and%20multiple%20models.ipynb))

But at the end of the above note, I have also discussed the option of smaller subclass following the **Single Responsibility** principle in SOLID. Therefore, this time as an experiment, I hope to create a smaller custom class just for data preprocessing.

One interesting technical challenge is how to give meaningful names to the OneHotEncoded columns, and how to do it when one reference categories is dropped. Please see the solution in the class and also with more details in the Appendix.

In [2]:
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer 
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

In [37]:
class custom_preprocessor(BaseEstimator, TransformerMixin):
    
    def __init__(self, column_transformer = None):
        self.column_transformer = None
    
    def get_column_names(self, X):
        transformed_cols = []

        if self.numeric_columns:
            transformed_cols += self.numeric_columns
        if self.categorical_columns:
            encoder = self.column_transformer.named_transformers_['cat']['encoder']
            categorical_values = encoder.categories_
            for col, values in zip(self.categorical_columns, categorical_values):
                transformed_cols += [f'{col}_{val}' for val in values]
                # transformed_cols += [f'{col}_{val}' for val in values[1:]]  # Exclude the first value

        return transformed_cols
    
    def fit(self, X, y=None):
        self.numeric_columns = X.select_dtypes(include = np.number).columns.tolist()
        self.categorical_columns = X.select_dtypes(exclude = np.number).columns.tolist()
        
        if self.numeric_columns:
            num_transformer = Pipeline(steps = [
                ('imputer', SimpleImputer(strategy = 'mean')),
                ('scaler', StandardScaler(with_mean = True, with_std = True)),
            ]) 
        
        if self.categorical_columns:
            cat_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy = 'most_frequent', keep_empty_features = False)),
                ('encoder', OneHotEncoder()),
                #('encoder', OneHotEncoder(drop = 'first')), # Drop the first as reference category
            ])
        
        self.column_transformer = ColumnTransformer(
            transformers = [
                ('num', num_transformer, self.numeric_columns),
                ('cat', cat_transformer, self.categorical_columns),
            ],
            remainder = 'passthrough'
        )
        
        self.column_transformer.fit(X)

        # Get the column names after transformation
        self.column_names = self.get_column_names(X)

        return self
    
    def transform(self, X):
        if not self.column_transformer:
            raise ValueError('The preprocessor has not been fit yet')
        
        return pd.DataFrame(self.column_transformer.transform(X), columns=self.column_names)
    
    def fit_transform(self, X):
        self.fit(X)
        return self.transform(X)

In [17]:
# Fit and transform demo data
data = pd.DataFrame({
    'age': [30, np.nan, 35, 28, 30],
    'income': [50000, 60000, 75000, 70000, np.nan],
    'gender': ['Male', 'Female', np.nan, 'Male', 'Male'],
    'education': ['Bachelor', np.nan, 'PhD', 'Bachelor', 'Master'], 
})

print("Demo data before preprocessing\n", data)


Demo data before preprocessing
     age   income  gender education
0  30.0  50000.0    Male  Bachelor
1   NaN  60000.0  Female       NaN
2  35.0  75000.0     NaN       PhD
3  28.0  70000.0    Male  Bachelor
4  30.0      NaN    Male    Master


In [38]:
preprocessor = custom_preprocessor()
preprocessor.fit(data)
data_preprocessed = pd.DataFrame(preprocessor.transform(data))
print("Demo data after preprocessing\n", data_preprocessed)

Demo data after preprocessing
         age    income  gender_Male  education_Master  education_PhD
0 -0.324253 -1.601112          1.0               0.0            0.0
1  0.000000 -0.436667          0.0               0.0            0.0
2  1.837435  1.310001          1.0               0.0            1.0
3 -1.188929  0.727778          1.0               0.0            0.0
4 -0.324253  0.000000          1.0               1.0            0.0


## Discussions

In the above class, we leveraged `Pipeline()` and `ColumnTransformer()` to create steps to preprocess numeric and categorical data, respectively. 

One interesting consideration regarding the steps is whether to impute before or after scaling. As a simple demo, I did the imputation first then the scaling for both numeric and categorical features. But for numeric variables, one would need to weight the pro and con of these two alternatives for each specific project based on particulars such as the imputation algorithm used and the distribution of the feature being imputed. (Quite some discussions we can find online, here is [one example](https://stats.stackexchange.com/questions/138203/imputation-of-missing-data-before-or-after-centering-and-scaling#:~:text=It%20really%20depends%20on%20the,lower%20magnitude%20values%20converge%20faster.))

Another thing to consider is the condition before imputation. If a feature has 25% of missing values, do we still impute them, how about 50% missing? With a custom class like this, we can easily incorporate our threshold into the custom preprocessor. In the same token, any specific logic can be easily added to the custom preprocessor and we will only need to maintian and update these rules in one place. This is one big advantage of the OOP (vs functional) approach for such usage cases. 


## Appendix and Tests

### Drop one category in OneHotEncoding

Above in the class definition, we have commented out code which allow us to drop the first category of each categorical variable. This is useful when we work with ML algorithms that are sensitive to multicollinearity (e.g., linear and logistic regressions), but not as big a deal for algorithms that are less affected by multicollinearity (e.g., tree-based algorithms and neural networks). 

Below are the code to quickly test how the result differ whether a reference category is dropped. 

In [80]:
def custom_encode(data: pd.DataFrame, drop_first: bool = False) -> pd.DataFrame:
    """This function encode input df with one-hot-encoding and return a encoded df. 

    Args:
        data (pd.DataFrame): a DataFrame to be encoded.
        drop_first (bool, optional): whether to drop the first category as the reference category. Defaults to False.

    Returns:
        pd.DataFrame: the encoded DataFrame. 
    """
    # Columns to be one-hot encoded
    categorical_cols = data.columns

    # Create an instance of OneHotEncoder
    if drop_first:
        encoder = OneHotEncoder(drop = 'first')
    else:
        encoder = OneHotEncoder()

    # Fit and transform the data
    encoded_data = encoder.fit_transform(data[categorical_cols])
    # Get the unique category values
    unique_categories = encoder.categories_
    print(unique_categories)

    # Generate column names for the encoded features
    if drop_first:
        column_names = [f'{col}_{category}' for col, categories in zip(categorical_cols, unique_categories) for category in categories[1:]]
    else:
        column_names = [f'{col}_{category}' for col, categories in zip(categorical_cols, unique_categories) for category in categories]

    # Convert the result to a dense array for visualization
    encoded_array = encoded_data.toarray()

    # Create a DataFrame with dynamically generated column names
    encoded_df = pd.DataFrame(encoded_array, columns=column_names)

    return encoded_df

In [None]:
# Example data as a DataFrame
data = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Red', 'Blue'],
                     'Size': ['Large', 'Medium', 'Large', 'Small', 'Medium']})

In [81]:
custom_encode(data, drop_first = True)

[array(['Blue', 'Green', 'Red'], dtype=object), array(['Large', 'Medium', 'Small'], dtype=object)]


Unnamed: 0,Color_Green,Color_Red,Size_Medium,Size_Small
0,0.0,1.0,0.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,1.0,0.0,1.0
4,0.0,0.0,1.0,0.0


In [79]:
custom_encode(data, drop_first = False)

Unnamed: 0,Color_Blue,Color_Green,Color_Red,Size_Large,Size_Medium,Size_Small
0,0.0,0.0,1.0,1.0,0.0,0.0
1,1.0,0.0,0.0,0.0,1.0,0.0
2,0.0,1.0,0.0,1.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,1.0
4,1.0,0.0,0.0,0.0,1.0,0.0
