# ML Tips & Best Practices

This notebook discusses ML best practices and how to easily implement them using `mlarena`. 

In [None]:
# Standard library imports
import multiprocessing
import os

# Third party imports
import lightgbm as lgb
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split
import mlflow
mlflow.autolog(disable=True)

from mlarena import PreProcessor, ML_PIPELINE

# Configure parallel processing
# Only needed when running locally (not required on distributed platforms like Databricks)
n_cores = multiprocessing.cpu_count()
n_jobs = max(1, n_cores // 2)  # Use half of available cores to avoid overloading
os.environ["LOKY_MAX_CPU_COUNT"] = str(n_jobs)

## Smart One-Hot Encoding 🎭

### The Tale of Two Models

Linear models and tree-based models have different preferences for dropping categories in one-hot encoding:

* **Linear Models** prefers having one category dropped:
  - Avoid perfect multicollinearity 
    > When all dummy variables are included, they sum to 1, creating perfect multicollinearity which harms the performance of linear models.
  - Makes coefficients more interpretable
    >Dropping one category establishes it as the reference point, so each coefficient shows the effect compared to that baseline category.
  - Improves numerical stability
    >Removing redundant information improves matrix conditioning, leading to more stable and reliable parameter estimates.
  
* **Tree Models** 🌲 may prefer having all categories:
  - Can directly split on any category 
    > Tree models evaluate one feature at a time. If a category is dropped, it can only be inferred when all other dummy features are zero — a pattern that tree models can't easily learn. Keeping all categories ensures the model can split explicitly on each one.
  - Clearer feature importance interpretation:  
    >Each category has its own dummy feature, making it possible to directly assess how important each category is to the model — no hidden or implicit categories.

However, for binary categories (with just two values), keeping only one column is generally more efficient regardless of model type. One column perfectly represents the information, while two columns would be redundant. ⚖️

It's worth noting that while these preferences exist, the choice between dropping categories or keeping them all is typically not a critical decision that dramatically impacts model performance. These are technical considerations that may offer incremental improvements, particularly for model interpretability and stability rather than substantial performance gains.

### An Elegant Solution 🥂

Sklearn's OneHotEncoder provides options to handle this smoothly through its `drop` parameter:

* `drop="first"`: 
  - Drops first category for all features
  - Ideal for linear models
  - More compact representation

* `drop="if_binary"`:
  - Only drops one category for binary features
  - Keeps all categories for multi-value features
  - Can be effective for tree-based models

This way you can optimize the encoding strategy based on your model type while maintaining efficient encoding for binary features. 

In [2]:
# Load data
titanic = fetch_openml("titanic", version=1, as_frame=True)
X = titanic.data
y = titanic.target.astype(int)
X = X.drop(["boat", "body", "home.dest", "ticket", "cabin", "name"], axis=1)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

### Quick Demo of the `drop` Parameter

Below you can see a demo comparing two settings of the `drop` parameter with a tree-based algorithm (`lightGBM`):
* Simple configuration with `mlarena` by passing the `drop` parameter to `PreProcessor` constructor
* `drop="first"`: Drops the first category for all categorical features
* `drop="if_binary"`: Only drops one category for binary features, keeps all categories for multi-value features

The results show:
* Some tree-based models may perform slightly better with `drop="if_binary"` due to the reasons discussed above
* The performance difference is generally small, so it's worth testing both approaches for your specific use case
* Binary features like 'sex' have one category dropped in both cases (as seen in the output)
* Multi-value features like 'embarked' retain all categories with `drop="if_binary"`

In [3]:
# define, fit and evaluate when drop for all categorical featuers in one-hot encoding
mlpipeline_drop = ML_PIPELINE(
    model=lgb.LGBMClassifier(verbose=-1), preprocessor=PreProcessor(drop="first")
)
mlpipeline_drop.fit(X_train, y_train)
results_drop = mlpipeline_drop.evaluate(
    X_test, y_test, verbose=False, visualize=False
)

In [5]:
# define, fit and evaluate when only drop for binary categories in one-hot encoding
mlpipeline_drop_binary_only = ML_PIPELINE(
    model=lgb.LGBMClassifier(verbose=-1), preprocessor=PreProcessor(drop="if_binary")
)
mlpipeline_drop_binary_only.fit(X_train, y_train)
results_drop_binary_only = mlpipeline_drop_binary_only.evaluate(
    X_test, y_test, verbose=False, visualize=False
)

In [7]:
# Compare results
print(f"AUC when drop='first': {results_drop['auc']:.2f}")
print(f"AUC when drop='if_binary': {results_drop_binary_only['auc']:.2f}")

AUC when drop='first': 0.87
AUC when drop='if_binary': 0.88


In [12]:
X_test_transformed_drop = mlpipeline_drop.preprocessor.transform(X_test)
X_test_transformed_drop_binary_only = mlpipeline_drop_binary_only.preprocessor.transform(X_test)
print(
    f"When drop is set to be 'if_binary' (vs 'first'), the additional column in transformed feature set is {[item for item in X_test_transformed_drop_binary_only.columns.tolist() if item not in X_test_transformed_drop.columns.tolist()]}"
    f"\nThe binary sex feature will still be left with only one column."
)

When drop is set to be 'if_binary' (vs 'first'), the additional column in transformed feature set is ['embarked_C']
The binary sex feature will still be left with only one column.


In [11]:
# As expected, one column was dropped for the sex feature when drop is set to be "if_binary"
X_test_transformed_drop_binary_only.head()

Unnamed: 0,pclass,age,sibsp,parch,fare,sex_male,embarked_C,embarked_Q,embarked_S
1148,0.840359,0.451334,-0.495964,-0.442432,-0.510089,1.0,0.0,0.0,1.0
1049,0.840359,-0.721918,0.456833,0.676472,-0.343626,1.0,1.0,0.0,0.0
982,0.840359,-0.096184,-0.495964,-0.442432,-0.495198,1.0,0.0,0.0,1.0
808,0.840359,-0.096184,-0.495964,-0.442432,-0.492219,1.0,0.0,0.0,1.0
1195,0.840359,-0.096184,-0.495964,-0.442432,-0.498015,1.0,0.0,1.0,0.0
