# The Devil is in the Details: MLArena's Little Delights 🎯✨

Machine learning is full of those small but meaningful moments - from to presenting SHAP plots with actual values (rather than scaled ones) to stakeholders, to handling cryptic feature names that breaks the code, and to optimizing one-hot encoding strategy for different model types. These little details can make an ML practitioner's life so much smoother! That's why MLArena comes with thoughtful touches that show we've been in your shoes. 👣

Think of this as a collection of ML quality-of-life improvements - like having a cup holder in exactly the right spot, or finding out your new jacket has inside pockets. Small things that make you smile and wonder why they aren't everywhere. Let's explore these delightful details that make MLArena not just powerful, but pleasantly surprising! 🎁

## The Tale of Two Models: Smart One-Hot Encoding 🎭

### The One-Hot Encoding Dilemma

Linear models and tree-based models have different preferences when it comes to one-hot encoding:

* **Linear Models** love having one category dropped (`drop="first"`):
  - Eliminates perfect multicollinearity
  - Makes coefficients more interpretable (each coefficient represents difference from reference category)
  - Improves numerical stability
  
* **Tree Models** may prefer having all categories (`drop=None`):
  - Can directly split on any category:
    > Tree models evaluate one feature at a time. If a category is dropped, it can only be inferred when all other dummy features are zero — a pattern that tree models can't easily learn. Keeping all categories ensures the model can split explicitly on each one.
  - Better handling of feature interactions:
    >  When all categories are present, tree models can more easily capture interactions between specific categories and other features, leading to more expressive and accurate trees.
  - Clearer feature importance interpretation:  
    >Each category has its own dummy feature, making it possible to directly assess how important each category is to the model — no hidden or implicit categories.

But here is the twist: for binary categoreis with just two values, dropping one category is *always* better regardless of model type, because one column perfectly represent the split -- haveing two coumns adds no value while increasing dimensionality and render the results less interpretable (imaging reading feature importance of both gender_male and gender_female in shap plot)

### MLArena's Easy Solution 🥂

It is a bit of coding to do to implement the above process for linear and tree models respectively, but this is handled automatically in `mlarena`. Specifically, a `drop_first` parameter is included in `PreProcessor`:

* when `drop_first` = True
    * set drop = fist for all one-hot encoded features, 
    * Ideal for linear models

* when `drop_first` = False
    * set drop = fist only for binary one-hot encoded features, 
    * May work better for some tree-based models

This way:
- Binary features stay efficient (always drop one category)
- Multi-value features can be optimized for your particular model depending on its type (linear or tree-based)
- You get experiment with and optimize encoding with zero hazzle  

In [11]:
# Standard library imports
import multiprocessing
import os

# Third party imports
import lightgbm as lgb
import pandas as pd
from sklearn.datasets import (
    fetch_openml
)
from sklearn.ensemble import (
    RandomForestClassifier
) 
  
from sklearn.model_selection import train_test_split

from mlarena import PreProcessor, ML_PIPELINE

# Configure parallel processing
# Only needed when running locally (not required on distributed platforms like Databricks)
n_cores = multiprocessing.cpu_count()
n_jobs = max(1, n_cores // 2)  # Use half of available cores to avoid overloading
os.environ["LOKY_MAX_CPU_COUNT"] = str(n_jobs)

In [2]:
# Load data
titanic = fetch_openml('titanic', version=1, as_frame=True)
X = titanic.data
y = titanic.target.astype(int)  
X = X.drop(['boat', 'body', 'home.dest', 'ticket', 'cabin', 'name'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Experiment with the `drop_first` Parameter

Below you can see demo of the drop_first parameter implemented on both lightgbm and RandomForest classifier. 

In [3]:
# define pipeline
mlpipeline_drop = ML_PIPELINE(
    model = lgb.LGBMClassifier(verbose=-1),
    preprocessor = PreProcessor(drop_first = True)
    )
# fit pipeline
mlpipeline_drop.fit(X_train,y_train)
# evaluate pipeline with metrics and visualization
results = mlpipeline_drop.evaluate(X_test,y_test, verbose=True, visualize=False)

Classification Metrics Report

Evaluation Parameters:
Threshold: 0.500
Beta:      1.000

Metrics:
Accuracy:  0.817
F1:        0.782
Precision: 0.843
Recall:    0.729
Pos Rate:  0.389

AUC (threshold independent):
AUC:   0.872


In [8]:
X_test_transformed_drop = mlpipeline_drop.preprocessor.transform(X_test)
X_test_transformed_drop.head()

Unnamed: 0,pclass,age,sibsp,parch,fare,sex_male,embarked_Q,embarked_S
1148,0.840359,0.451334,-0.495964,-0.442432,-0.510089,1.0,0.0,1.0
1049,0.840359,-0.721918,0.456833,0.676472,-0.343626,1.0,0.0,0.0
982,0.840359,-0.096184,-0.495964,-0.442432,-0.495198,1.0,0.0,1.0
808,0.840359,-0.096184,-0.495964,-0.442432,-0.492219,1.0,0.0,1.0
1195,0.840359,-0.096184,-0.495964,-0.442432,-0.498015,1.0,1.0,0.0


In [4]:
# define pipeline
mlpipeline_no_drop = ML_PIPELINE(
    model = lgb.LGBMClassifier(verbose=-1),
    preprocessor = PreProcessor(drop_first = False)
    )
# fit pipeline
mlpipeline_no_drop.fit(X_train,y_train)
# evaluate pipeline with metrics and visualization
results = mlpipeline_no_drop.evaluate(X_test,y_test, verbose=True, visualize=False)

Classification Metrics Report

Evaluation Parameters:
Threshold: 0.500
Beta:      1.000

Metrics:
Accuracy:  0.805
F1:        0.765
Precision: 0.838
Recall:    0.703
Pos Rate:  0.378

AUC (threshold independent):
AUC:   0.876


In [9]:
X_test_transformed_no_drop = mlpipeline_no_drop.preprocessor.transform(X_test)
X_test_transformed_no_drop.head()

Unnamed: 0,pclass,age,sibsp,parch,fare,sex_male,embarked_C,embarked_Q,embarked_S
1148,0.840359,0.451334,-0.495964,-0.442432,-0.510089,1.0,0.0,0.0,1.0
1049,0.840359,-0.721918,0.456833,0.676472,-0.343626,1.0,1.0,0.0,0.0
982,0.840359,-0.096184,-0.495964,-0.442432,-0.495198,1.0,0.0,0.0,1.0
808,0.840359,-0.096184,-0.495964,-0.442432,-0.492219,1.0,0.0,0.0,1.0
1195,0.840359,-0.096184,-0.495964,-0.442432,-0.498015,1.0,0.0,1.0,0.0


In [16]:
mlpipeline_drop = ML_PIPELINE(
    model = RandomForestClassifier(),
    preprocessor = PreProcessor(drop_first = True)
    )
# fit pipeline
mlpipeline_drop.fit(X_train,y_train)
# evaluate pipeline with metrics and visualization
results = mlpipeline_drop.evaluate(X_test,y_test, verbose=True, visualize=False)

Classification Metrics Report

Evaluation Parameters:
Threshold: 0.500
Beta:      1.000

Metrics:
Accuracy:  0.798
F1:        0.758
Precision: 0.822
Recall:    0.703
Pos Rate:  0.385

AUC (threshold independent):
AUC:   0.855


In [17]:
mlpipeline_no_drop = ML_PIPELINE(
    model = RandomForestClassifier(),
    preprocessor = PreProcessor(drop_first = False)
    )
# fit pipeline
mlpipeline_no_drop.fit(X_train,y_train)
# evaluate pipeline with metrics and visualization
results = mlpipeline_no_drop.evaluate(X_test,y_test, verbose=True, visualize=False)

Classification Metrics Report

Evaluation Parameters:
Threshold: 0.500
Beta:      1.000

Metrics:
Accuracy:  0.802
F1:        0.759
Precision: 0.837
Recall:    0.695
Pos Rate:  0.374

AUC (threshold independent):
AUC:   0.861
