# Mean Encoding for Categorical Features

Based on evaluating two potential approaches for implementing mean encoding functionality in our preprocessing pipeline, the preferred approach is creating a separate `MeanEncoder` class rather than integrating it directly into the existing `PreProcessor` class. This decision aligns with the Single Responsibility Principle and mirrors our current architecture, where numeric and categorical transformations are handled as distinct operations. A separate `MeanEncoder` class will not only be easier to test, maintain, and extend, but will also provide the flexibility to implement advanced features such as cross-validation folds and smoothing parameters. While this approach requires managing an additional class and its interactions, the benefits of improved modularity, reusability, and cleaner code organization outweigh these minor drawbacks.

In [4]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
import pandas as pd
import numpy as np

# MeanEncoder Class

## From Scratch or Existing Package

**Using Existing Package**
Packages like `category_encoders` can be used in MeanEncoder. The pros and cons of this approach:
* Pros:
    * Proven, tested code
    * Immediate access to multiple encoding strategies
    * Community support and bug fixes
    * Time-saving for basic functionality
* Cons:
    * Less control over implementation details
    * May include unnecessary dependencies
    * Harder to customize or extend in specific ways
    * May not perfectly match our specific needs

**Building from Scratch** 
* Pros:
    * Complete control over implementation
    * Can perfectly match our specific requirements
    * Better understanding of the codebase
    * Easier to extend with custom functionality
    * No external dependencies
* Cons:
    * More time to implement
    * Need to write own tests
    * Need to handle edge cases ourselves
    * Risk of introducing bugs

**Recommendations**: Given that we want to build sophisticated, extendable functionality, building from scratch is recommended because:
* We'll need fine-grained control over CV and smoothing implementations
* We might want to add custom features specific to our use case
* The core mean encoding logic is relatively straightforward
* We can start simple and gradually add complexity

Moreover, we could still use category_encoders as a reference for best practices and edge case handling. 



## MeanEncoder Version 1

Let's start simple: This version only gets the category mean as the encoded values. As suggested at the top, TargetEncoder from `category_encoders` is leveraged for testing. 

In [5]:
class MeanEncoder(BaseEstimator, TransformerMixin):
    """
    Basic mean encoder for categorical features.
    Maps each category to the mean of target values for that category.
    """
    
    def __init__(self):
        self.encodings = {}
        self.global_means = {}
        
    def fit(self, X, y):
        if y is None:
            raise ValueError("Target variable y is required for mean encoding")
            
        # Process each feature
        for column in X.columns:
            # Calculate global mean as fallback for unknown categories
            self.global_means[column] = y.mean()
            
            # Calculate mean target value for each category
            self.encodings[column] = y.groupby(X[column]).mean()
            
        return self
    
    def transform(self, X):
        X_encoded = pd.DataFrame(index=X.index)
        
        for column in X.columns:
            # Map categories to their mean target values
            # Use global mean for unknown categories
            X_encoded[column] = X[column].map(self.encodings[column]).fillna(self.global_means[column])
            
        return X_encoded

In [22]:
import pandas as pd
import numpy as np
from category_encoders import TargetEncoder

# Create demo dataset for testing
np.random.seed(42)
data = pd.DataFrame({
    'category_1': np.random.choice(['A', 'B', 'C'], size=100),
    'category_2': np.random.choice(['X', 'Y', 'Z'], size=100),
})
target = pd.Series([
    np.random.binomial(1, 0.8) if cat == 'A' else
    np.random.binomial(1, 0.5) if cat == 'B' else
    np.random.binomial(1, 0.2)
    for cat in data['category_1']
])

# Test our implementation
our_encoder = MeanEncoder()
our_encoded = our_encoder.fit_transform(data, target)

# Test category_encoders implementation
ce_encoder = TargetEncoder(smoothing=1.0)
ce_encoded = ce_encoder.fit_transform(data, target)

# Show actual means for reference
print("Actual means per category_1:")
print(target.groupby(data['category_1']).mean())

# Comparison
print("\nFirst few rows comparison:")
print("\nOriginal data with our encoding:")
comparison = pd.concat([
    data.head(),
    our_encoded.head().add_suffix('_our_encoded'),
    ce_encoded.head().add_suffix('_ce_encoded')
], axis=1)
print(comparison)

Actual means per category_1:
category_1
A    0.757576
B    0.638889
C    0.193548
dtype: float64

First few rows comparison:

Original data with our encoding:
  category_1 category_2  category_1_our_encoded  category_2_our_encoded  \
0          C          Z                0.193548                0.500000   
1          A          Z                0.757576                0.500000   
2          C          Z                0.193548                0.500000   
3          C          X                0.193548                0.575758   
4          A          Z                0.757576                0.500000   

   category_1_ce_encoded  category_2_ce_encoded  
0               0.193554               0.500000  
1               0.757575               0.500000  
2               0.193554               0.500000  
3               0.193554               0.575757  
4               0.757575               0.500000  


# Incorporating into PreProcessor

In [2]:
class PreProcessor(BaseEstimator, TransformerMixin):
    """
    Custom transformer for data preprocessing.
    
    - Scales numeric features
    - Encodes categorical features
    - Handles missing values via imputation
    - Compatible with scikit-learn pipeline
    
    Attributes:
        num_impute_strategy (str): Numeric imputation strategy
        cat_impute_strategy (str): Categorical imputation strategy
        num_transformer (Pipeline): Numeric preprocessing pipeline
        cat_transformer (Pipeline): Categorical preprocessing pipeline
        transformed_cat_cols (List[str]): One-hot encoded column names
        num_features (List[str]): Numeric feature names
        cat_features (List[str]): Categorical feature names
    """

    def __init__(self, num_impute_strategy='median', 
                 cat_impute_strategy='most_frequent'):
        """
        Initialize the transformer.
        
        - Sets up numeric data transformer
        - Sets up categorical data transformer
        - Configures imputation strategies
        
        Parameters:
            num_impute_strategy (str): Strategy for numeric missing values
            cat_impute_strategy (str): Strategy for categorical missing values
            mean_encode_cols (List[str]): Columns to apply mean encoding
        """
        self.num_impute_strategy = num_impute_strategy
        self.cat_impute_strategy = cat_impute_strategy

    def fit(self, X, y=None):
        """
        Fit transformer on input data.
        
        - Identifies feature types
        - Configures feature scaling
        - Sets up encoding
        - Fits imputation strategies
        
        Parameters:
            X (pd.DataFrame): Input features
            y (pd.Series, optional): Target variable, not used
        
        Returns:
            CustomTransformer: Fitted transformer
        """
        self.num_features = X.select_dtypes(include=np.number).columns.tolist()
        self.cat_features = [col for col in X.select_dtypes(exclude=np.number).columns if col not in (self.mean_encode_cols or [])]

        # Handle mean encoding features
        if self.mean_encode_cols:
            self.mean_encoder = MeanEncoder()
            mean_encode_features = [f for f in self.mean_encode_cols if f in X.columns]
            if mean_encode_features:
                self.mean_encoder.fit(X[mean_encode_features], y)

        if self.num_features:
            self.num_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy=self.num_impute_strategy)),
                ('scaler', StandardScaler())
            ])
            self.num_transformer.fit(X[self.num_features])
        
        if self.cat_features:
            self.cat_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy=self.cat_impute_strategy)),
                ('encoder', OneHotEncoder(handle_unknown='ignore'))
            ])
            self.cat_transformer.fit(X[self.cat_features])
        
        return self

    def get_transformed_cat_cols(self):
        """
        Get transformed categorical column names.
        
        - Creates names after one-hot encoding
        - Combines category with encoded values
        
        Returns:
            List[str]: One-hot encoded column names
        """
        cat_cols = []
        cats = self.cat_features
        cat_values = self.cat_transformer['encoder'].categories_
        for cat, values in zip(cats, cat_values):
            cat_cols += [f'{cat}_{value}' for value in values]
        
        return cat_cols

    def transform(self, X):
        """
        Transform input data.
        
        - Applies fitted scaling
        - Applies fitted encoding
        - Handles numeric and categorical features
        
        Parameters:
            X (pd.DataFrame): Input features
        
        Returns:
            pd.DataFrame: Transformed data
        """
        X_transformed = pd.DataFrame()

        if self.mean_encode_cols and hasattr(self, 'mean_encoder'):
            mean_encoded_features = self.mean_encoder.transform(X[self.mean_encode_cols])
            X_transformed = pd.concat([X_transformed, mean_encoded_features], axis=1)

        if self.num_features:
            transformed_num_data = self.num_transformer.transform(X[self.num_features])
            X_transformed[self.num_features] = transformed_num_data
        
        if self.cat_features:
            transformed_cat_data = self.cat_transformer.transform(X[self.cat_features]).toarray()
            self.transformed_cat_cols = self.get_transformed_cat_cols()
            transformed_cat_df = pd.DataFrame(transformed_cat_data, columns=self.transformed_cat_cols)
            X_transformed = pd.concat([X_transformed, transformed_cat_df], axis=1)
        
        X_transformed.index = X.index

        return X_transformed

    def fit_transform(self, X, y=None):
        """
        Fit and transform input data.
        
        - Fits transformer to data
        - Applies transformation
        - Combines both operations
        
        Parameters:
            X (pd.DataFrame): Input features
            y (pd.Series, optional): Target variable, not used
        
        Returns:
            pd.DataFrame: Transformed data
        """
        self.fit(X, y)
        return self.transform(X)