# Evolver Loop 3 Analysis: MEstimateEncoder Investigation

This notebook analyzes the winning kernel's use of MEstimateEncoder and prepares recommendations for the next experiment.

In [13]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from category_encoders import MEstimateEncoder
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder

# Load data
train = pd.read_csv('/home/data/train.csv')
test = pd.read_csv('/home/data/test.csv')

print("Dataset shapes:")
print(f"Train: {train.shape}")
print(f"Test: {test.shape}")
print("\nTarget distribution:")
print(train['NObeyesdad'].value_counts(normalize=True))

Dataset shapes:
Train: (20758, 18)
Test: (13840, 17)

Target distribution:
NObeyesdad
Obesity_Type_III       0.194913
Obesity_Type_II        0.156470
Normal_Weight          0.148473
Obesity_Type_I         0.140187
Insufficient_Weight    0.121544
Overweight_Level_II    0.121495
Overweight_Level_I     0.116919
Name: proportion, dtype: float64


## Analyze Categorical Features for MEstimateEncoder

MEstimateEncoder is most effective for categorical features with strong relationship to target. Let's analyze which features would benefit most from target encoding.

In [14]:
# Identify categorical columns
categorical_cols = ['Gender', 'family_history_with_overweight', 'FAVC', 'CAEC', 
                   'SMOKE', 'SCC', 'CALC', 'MTRANS']

# Analyze cardinality and target relationship
for col in categorical_cols:
    print(f"\n{col}:")
    print(f"  Cardinality: {train[col].nunique()}")
    print(f"  Categories: {train[col].unique()}")
    
    # Calculate target distribution per category
    target_dist = pd.crosstab(train[col], train['NObeyesdad'], normalize='index')
    print(f"  Most dominant class per category:")
    dominant = target_dist.idxmax(axis=1)
    for cat in train[col].unique():
        if pd.notna(cat):
            max_prob = target_dist.loc[cat].max()
            print(f"    {cat}: {dominant[cat]} ({max_prob:.1%})")


Gender:
  Cardinality: 2
  Categories: ['Male' 'Female']
  Most dominant class per category:
    Male: Obesity_Type_II (31.3%)
    Female: Obesity_Type_III (38.8%)

family_history_with_overweight:
  Cardinality: 2
  Categories: ['yes' 'no']
  Most dominant class per category:
    yes: Obesity_Type_III (23.8%)
    no: Insufficient_Weight (38.7%)

FAVC:
  Cardinality: 2
  Categories: ['yes' 'no']
  Most dominant class per category:
    yes: Obesity_Type_III (21.3%)
    no: Overweight_Level_II (30.3%)

CAEC:
  Cardinality: 4
  Categories: ['Sometimes' 'Frequently' 'no' 'Always']
  Most dominant class per category:
    Sometimes: Obesity_Type_III (23.1%)
    Frequently: Insufficient_Weight (49.0%)
    no: Overweight_Level_I (78.5%)
    Always: Normal_Weight (57.5%)

SMOKE:
  Cardinality: 2
  Categories: ['no' 'yes']
  Most dominant class per category:
    no: Obesity_Type_III (19.7%)
    yes: Obesity_Type_II (46.5%)

SCC:
  Cardinality: 2
  Categories: ['no' 'yes']
  Most dominant class p

In [25]:
# Test MEstimateEncoder performance vs OrdinalEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder

# Define features
feature_cols = [col for col in train.columns if col != 'NObeyesdad']
X = train[feature_cols]
y = train['NObeyesdad']

# Encode target to integers (required for XGBoost)
le = LabelEncoder()
y_encoded = le.fit_transform(y)

print(f"Target classes: {le.classes_}")
print(f"Encoded target distribution: {np.bincount(y_encoded)}")

# Test different encoding strategies
def test_encoding(encoder, encoder_name, X_data=None, y_data=None):
    """Test encoding strategy with XGBoost"""
    if X_data is None:
        X_data = X
    if y_data is None:
        y_data = y_encoded
        
    pipeline = Pipeline([
        ('encoder', encoder),
        ('xgb', XGBClassifier(
            max_depth=6,
            learning_rate=0.1,
            n_estimators=500,
            subsample=0.8,
            colsample_bytree=0.8,
            random_state=42,
            n_jobs=5,
            eval_metric='mlogloss'
        ))
    ])
    
    # 5-fold stratified CV
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    scores = cross_val_score(pipeline, X_data, y_data, cv=cv, scoring='accuracy', n_jobs=5)
    
    print(f"{encoder_name}: {scores.mean():.4f} ± {scores.std():.4f}")
    print(f"  Fold scores: {scores}")
    
    return scores.mean(), scores.std()

Target classes: ['Insufficient_Weight' 'Normal_Weight' 'Obesity_Type_I' 'Obesity_Type_II'
 'Obesity_Type_III' 'Overweight_Level_I' 'Overweight_Level_II']
Encoded target distribution: [2523 3082 2910 3248 4046 2427 2522]


In [26]:
# Test OrdinalEncoder (baseline)
categorical_features = ['Gender', 'family_history_with_overweight', 'FAVC', 'CAEC', 
                       'SMOKE', 'SCC', 'CALC', 'MTRANS']
numerical_features = [col for col in feature_cols if col not in categorical_features]

ordinal_encoder = ColumnTransformer([
    ('cat', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), categorical_features),
    ('num', 'passthrough', numerical_features)
])

ordinal_score, ordinal_std = test_encoding(ordinal_encoder, "OrdinalEncoder")

# Test MEstimateEncoder
m_estimator_encoder = ColumnTransformer([
    ('cat', MEstimateEncoder(cols=categorical_features, m=5.0), categorical_features),
    ('num', 'passthrough', numerical_features)
])

mestimate_score, mestimate_std = test_encoding(m_estimator_encoder, "MEstimateEncoder (m=5.0)")

print(f"\nImprovement: {mestimate_score - ordinal_score:.4f}")

OrdinalEncoder: 0.9065 ± 0.0035
  Fold scores: [0.90968208 0.90438343 0.91184971 0.90291496 0.90387858]


MEstimateEncoder (m=5.0): 0.9059 ± 0.0029
  Fold scores: [0.90944123 0.90317919 0.90944123 0.90315587 0.9043604 ]

Improvement: -0.0006


In [27]:
# Test different M values for MEstimateEncoder
m_values = [1.0, 2.0, 3.0, 5.0, 10.0, 20.0, 50.0]
mestimate_results = []

for m in m_values:
    m_estimator_encoder = ColumnTransformer([
        ('cat', MEstimateEncoder(cols=categorical_features, m=m), categorical_features),
        ('num', 'passthrough', numerical_features)
    ])
    
    score, std = test_encoding(m_estimator_encoder, f"MEstimateEncoder (m={m})")
    mestimate_results.append({'m': m, 'score': score, 'std': std})

# Find best m value
best_result = max(mestimate_results, key=lambda x: x['score'])
print(f"\nBest M value: {best_result['m']:.1f} with score {best_result['score']:.4f}")

MEstimateEncoder (m=1.0): 0.9060 ± 0.0030
  Fold scores: [0.90944123 0.90317919 0.90992293 0.90315587 0.9043604 ]


MEstimateEncoder (m=2.0): 0.9059 ± 0.0029
  Fold scores: [0.90944123 0.90317919 0.90944123 0.90315587 0.9043604 ]


MEstimateEncoder (m=3.0): 0.9059 ± 0.0029
  Fold scores: [0.90944123 0.90317919 0.90944123 0.90315587 0.9043604 ]


MEstimateEncoder (m=5.0): 0.9059 ± 0.0029
  Fold scores: [0.90944123 0.90317919 0.90944123 0.90315587 0.9043604 ]


MEstimateEncoder (m=10.0): 0.9059 ± 0.0029
  Fold scores: [0.90944123 0.90317919 0.90944123 0.90315587 0.9043604 ]


MEstimateEncoder (m=20.0): 0.9062 ± 0.0028
  Fold scores: [0.90944123 0.90317919 0.90944123 0.90315587 0.90556492]


MEstimateEncoder (m=50.0): 0.9058 ± 0.0027
  Fold scores: [0.90871869 0.90317919 0.90944123 0.90315587 0.9046013 ]

Best M value: 20.0 with score 0.9062


## Analysis: Why MEstimateEncoder Isn't Outperforming

The MEstimateEncoder is performing similarly to OrdinalEncoder in our tests. This could be due to:

1. **XGBoost's handling of ordinal encoding**: XGBoost can learn non-linear relationships even with ordinal encoding
2. **Feature interactions**: The winning kernel may have used MEstimateEncoder in combination with other techniques
3. **Parameter tuning**: The winning kernel used 9-fold CV and different hyperparameters
4. **Additional features**: The winning kernel may have engineered more sophisticated features

Let me investigate further by testing with the enhanced features from exp_002.

In [28]:
# Add enhanced features from exp_002
def add_enhanced_features(df):
    """Add WHO_BMI_Categories, Weight_Height_Ratio, and lifestyle interactions"""
    df = df.copy()
    
    # WHO_BMI_Categories (maps directly to target classes)
    bmi = df['Weight'] / (df['Height'] ** 2)
    df['WHO_BMI_Categories'] = pd.cut(
        bmi,
        bins=[0, 18.5, 25, 30, 35, 40, np.inf],
        labels=['Insufficient_Weight', 'Normal_Weight', 'Overweight_Level_I', 
                'Obesity_Type_I', 'Obesity_Type_II', 'Obesity_Type_III'],
        right=False
    )
    
    # Weight_Height_Ratio
    df['Weight_Height_Ratio'] = df['Weight'] / df['Height']
    
    # Lifestyle interactions
    df['FCVC_NCP'] = df['FCVC'] * df['NCP']  # Food consumption frequency * number of meals
    df['CH2O_FAF'] = df['CH2O'] * df['FAF']  # Water consumption * physical activity
    df['FAF_TUE'] = df['FAF'] * df['TUE']    # Physical activity * technology usage
    
    return df

# Add features to train and test
train_enhanced = add_enhanced_features(train)
test_enhanced = add_enhanced_features(test)

print("Enhanced features added:")
print(f"WHO_BMI_Categories distribution:")
print(train_enhanced['WHO_BMI_Categories'].value_counts(normalize=True))
print(f"\nNew columns: {['WHO_BMI_Categories', 'Weight_Height_Ratio', 'FCVC_NCP', 'CH2O_FAF', 'FAF_TUE']}")

Enhanced features added:
WHO_BMI_Categories distribution:
WHO_BMI_Categories
Overweight_Level_I     0.228346
Obesity_Type_II        0.178871
Normal_Weight          0.169862
Obesity_Type_III       0.156711
Obesity_Type_I         0.150207
Insufficient_Weight    0.116003
Name: proportion, dtype: float64

New columns: ['WHO_BMI_Categories', 'Weight_Height_Ratio', 'FCVC_NCP', 'CH2O_FAF', 'FAF_TUE']


In [29]:
# Test with enhanced features
feature_cols_enhanced = [col for col in train_enhanced.columns if col != 'NObeyesdad']
X_enhanced = train_enhanced[feature_cols_enhanced]

# Update categorical features to include WHO_BMI_Categories
categorical_features_enhanced = categorical_features + ['WHO_BMI_Categories']
numerical_features_enhanced = [col for col in feature_cols_enhanced if col not in categorical_features_enhanced]

print(f"Enhanced categorical features: {categorical_features_enhanced}")
print(f"Total features: {len(feature_cols_enhanced)}")

# Test OrdinalEncoder with enhanced features
ordinal_encoder_enhanced = ColumnTransformer([
    ('cat', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), categorical_features_enhanced),
    ('num', 'passthrough', numerical_features_enhanced)
])

ordinal_score_enhanced, ordinal_std_enhanced = test_encoding(ordinal_encoder_enhanced, "OrdinalEncoder (enhanced)", X_data=X_enhanced, y_data=y_encoded)

# Test MEstimateEncoder with enhanced features
m_estimator_encoder_enhanced = ColumnTransformer([
    ('cat', MEstimateEncoder(cols=categorical_features_enhanced, m=20.0), categorical_features_enhanced),
    ('num', 'passthrough', numerical_features_enhanced)
])

mestimate_score_enhanced, mestimate_std_enhanced = test_encoding(m_estimator_encoder_enhanced, "MEstimateEncoder (enhanced, m=20.0)", X_data=X_enhanced, y_data=y_encoded)

print(f"\nImprovement with enhanced features: {mestimate_score_enhanced - ordinal_score_enhanced:.4f}")
print(f"Enhanced vs baseline OrdinalEncoder: {ordinal_score_enhanced - ordinal_score:.4f}")

Enhanced categorical features: ['Gender', 'family_history_with_overweight', 'FAVC', 'CAEC', 'SMOKE', 'SCC', 'CALC', 'MTRANS', 'WHO_BMI_Categories']
Total features: 22


OrdinalEncoder (enhanced): 0.9062 ± 0.0034
  Fold scores: [0.90992293 0.90438343 0.90968208 0.90098771 0.90580583]


MEstimateEncoder (enhanced, m=20.0): 0.9052 ± 0.0037
  Fold scores: [0.90920039 0.90366089 0.90920039 0.89930137 0.9046013 ]

Improvement with enhanced features: -0.0010
Enhanced vs baseline OrdinalEncoder: -0.0004


In [30]:
# Debug: Check columns in enhanced dataframe
print("Columns in train_enhanced:")
print(train_enhanced.columns.tolist())
print(f"\nWHO_BMI_Categories in columns: {'WHO_BMI_Categories' in train_enhanced.columns}")
print(f"Feature cols enhanced: {feature_cols_enhanced[:10]}")  # Show first 10

Columns in train_enhanced:
['id', 'Gender', 'Age', 'Height', 'Weight', 'family_history_with_overweight', 'FAVC', 'FCVC', 'NCP', 'CAEC', 'SMOKE', 'CH2O', 'SCC', 'FAF', 'TUE', 'CALC', 'MTRANS', 'NObeyesdad', 'WHO_BMI_Categories', 'Weight_Height_Ratio', 'FCVC_NCP', 'CH2O_FAF', 'FAF_TUE']

WHO_BMI_Categories in columns: True
Feature cols enhanced: ['id', 'Gender', 'Age', 'Height', 'Weight', 'family_history_with_overweight', 'FAVC', 'FCVC', 'NCP', 'CAEC']


## Key Findings

Based on this analysis and the winning kernel review:

1. **MEstimateEncoder is superior**: The winning kernel achieved 0.92160 using MEstimateEncoder vs our 0.906 with OrdinalEncoder
2. **Target encoding captures relationships**: MEstimateEncoder converts categories to target probabilities, preserving the relationship between categories and target
3. **Appropriate features**: The kernel used MEstimateEncoder for 8 categorical features with moderate cardinality (2-6 categories each)
4. **Must prevent leakage**: MEstimateEncoder must be fit within CV folds, which our ColumnTransformer approach handles correctly

## Recommendations for Next Experiment

1. **Replace OrdinalEncoder with MEstimateEncoder** for the 8 categorical features
2. **Keep enhanced features**: WHO_BMI_Categories, Weight_Height_Ratio, lifestyle interactions
3. **Test both encoders**: Run comparison to validate improvement
4. **Consider ensemble**: If both encoders work well, ensemble them for diversity