# Evolver Loop 4 Analysis: Diagnosing Declining Scores & Winning Kernel Strategy

This notebook analyzes why our scores are declining and extracts the full strategy from the winning kernel.

In [1]:
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Load session state to understand experiment history
session_file = Path('/home/code/session_state.json')
with open(session_file, 'r') as f:
    session_state = json.load(f)

print("=== EXPERIMENT HISTORY ===")
for exp in session_state['experiments']:
    print(f"{exp['id']}: {exp['score']} - {exp['name']}")
    print(f"  Notes: {exp['notes'][:200]}...")
    print()

print(f"\nScore Trend: {session_state['experiments'][0]['score']} → {session_state['experiments'][-1]['score']}")
print(f"Decline: {session_state['experiments'][0]['score'] - session_state['experiments'][-1]['score']:.4f}")

=== EXPERIMENT HISTORY ===
exp_000: 0.9071 - 001_baseline_xgboost
  Notes: Baseline XGBoost model with basic feature engineering. Features: BMI (critical), age groups (5 bins), simple interactions (Age*Height, Age*Weight). Categorical encoding with LabelEncoder. 5-fold strat...

exp_001: 0.906 - exp_002_enhanced_features
  Notes: Enhanced feature engineering with WHO_BMI_Categories (71.88% standalone accuracy), Weight_Height_Ratio, and lifestyle interactions (FCVC_NCP, CH2O_FAF, FAF_TUE). CRITICAL FIX: Replaced LabelEncoder wi...

exp_002: 0.9052 - exp_003_winning_kernel_approach
  Notes: Implemented winning kernel approach with MEstimateEncoder on 8 categorical features, BMI calculation, column rounding, and 9-fold CV. Used XGBoost with Optuna-tuned hyperparameters (n_estimators=131, ...


Score Trend: 0.9071 → 0.9052
Decline: 0.0019


## 1. Why Are Scores Declining?

Let's analyze the differences between experiments to understand the decline.

In [2]:
# Load experiment details
exp_000 = pd.read_csv('/home/code/submission_candidates/candidate_000.csv')
exp_001 = pd.read_csv('/home/code/submission_candidates/candidate_001.csv')
exp_002 = pd.read_csv('/home/code/submission_candidates/candidate_002.csv')

print("=== SUBMISSION COMPARISON ===")
print("\nClass distribution comparison:")
print("exp_000 (baseline):")
print(exp_000['NObeyesdad'].value_counts().sort_index())
print("\nexp_001 (enhanced features):")
print(exp_001['NObeyesdad'].value_counts().sort_index())
print("\nexp_002 (MEstimateEncoder):")
print(exp_002['NObeyesdad'].value_counts().sort_index())

# Check if predictions are actually different
print(f"\n=== PREDICTION OVERLAP ===")
print(f"exp_000 vs exp_001: {(exp_000['NObeyesdad'] == exp_001['NObeyesdad']).mean():.1%} same predictions")
print(f"exp_001 vs exp_002: {(exp_001['NObeyesdad'] == exp_002['NObeyesdad']).mean():.1%} same predictions")
print(f"exp_000 vs exp_002: {(exp_000['NObeyesdad'] == exp_002['NObeyesdad']).mean():.1%} same predictions")

=== SUBMISSION COMPARISON ===

Class distribution comparison:
exp_000 (baseline):
NObeyesdad
Insufficient_Weight    1715
Normal_Weight          2122
Obesity_Type_I         2070
Obesity_Type_II        2116
Obesity_Type_III       2625
Overweight_Level_I     1442
Overweight_Level_II    1750
Name: count, dtype: int64

exp_001 (enhanced features):
NObeyesdad
Insufficient_Weight    1716
Normal_Weight          2110
Obesity_Type_I         2063
Obesity_Type_II        2122
Obesity_Type_III       2623
Overweight_Level_I     1457
Overweight_Level_II    1749
Name: count, dtype: int64

exp_002 (MEstimateEncoder):
NObeyesdad
Insufficient_Weight    1726
Normal_Weight          2117
Obesity_Type_I         2066
Obesity_Type_II        2112
Obesity_Type_III       2625
Overweight_Level_I     1424
Overweight_Level_II    1770
Name: count, dtype: int64

=== PREDICTION OVERLAP ===
exp_000 vs exp_001: 98.8% same predictions
exp_001 vs exp_002: 97.0% same predictions
exp_000 vs exp_002: 97.2% same predictions


## 2. Winning Kernel Deep Dive

Let's extract the complete strategy from the winning kernel (0.92160 score).

In [3]:
import json

# Load winning kernel
kernel_path = '/home/code/research/kernels/chinmayadatt_obesity-risk-prediction-multi-class-0-92160/obesity-risk-prediction-multi-class-0-92160.ipynb'
with open(kernel_path, 'r') as f:
    winning_kernel = json.load(f)

print("=== WINNING KERNEL STRATEGY ===")
print(f"Number of cells: {len(winning_kernel['cells'])}")

# Extract key components
models_used = []
preprocessing_steps = []
ensemble_method = None

for i, cell in enumerate(winning_kernel['cells']):
    if cell['cell_type'] == 'code':
        source = ''.join(cell['source'])
        
        # Identify models
        if 'LGBMClassifier' in source and 'make_pipeline' in source:
            models_used.append(f"Cell {i}: LGBM Pipeline")
        elif 'XGBClassifier' in source and 'make_pipeline' in source:
            models_used.append(f"Cell {i}: XGB Pipeline")
        elif 'CatBoostClassifier' in source and 'make_pipeline' in source:
            models_used.append(f"Cell {i}: CatBoost Pipeline")
        elif 'RandomForestClassifier' in source and 'make_pipeline' in source:
            models_used.append(f"Cell {i}: RandomForest Pipeline")
            
        # Identify preprocessing
        if 'MEstimateEncoder' in source:
            preprocessing_steps.append(f"Cell {i}: MEstimateEncoder")
        elif 'age_rounder' in source:
            preprocessing_steps.append(f"Cell {i}: Age rounding")
        elif 'height_rounder' in source:
            preprocessing_steps.append(f"Cell {i}: Height rounding")
        elif 'extract_features' in source:
            preprocessing_steps.append(f"Cell {i}: BMI extraction")
            
        # Identify ensemble
        if 'weights' in source and 'predict_list' in source:
            ensemble_method = f"Cell {i}: Weighted ensemble"

print("\nModels used:")
for model in models_used:
    print(f"  - {model}")

print("\nPreprocessing steps:")
for step in preprocessing_steps:
    print(f"  - {step}")

print(f"\nEnsemble method: {ensemble_method}")

=== WINNING KERNEL STRATEGY ===
Number of cells: 69

Models used:
  - Cell 6: LGBM Pipeline
  - Cell 45: RandomForest Pipeline
  - Cell 48: LGBM Pipeline
  - Cell 52: LGBM Pipeline
  - Cell 55: XGB Pipeline
  - Cell 57: XGB Pipeline
  - Cell 60: CatBoost Pipeline
  - Cell 61: CatBoost Pipeline

Preprocessing steps:
  - Cell 6: MEstimateEncoder
  - Cell 36: Age rounding
  - Cell 45: MEstimateEncoder
  - Cell 48: MEstimateEncoder
  - Cell 55: MEstimateEncoder
  - Cell 57: MEstimateEncoder
  - Cell 60: MEstimateEncoder
  - Cell 61: MEstimateEncoder

Ensemble method: Cell 67: Weighted ensemble


In [4]:
# Extract the actual weights used in winning kernel
print("=== WINNING KERNEL ENSEMBLE WEIGHTS ===")

# Find the weights cell
for i, cell in enumerate(winning_kernel['cells']):
    if cell['cell_type'] == 'code':
        source = ''.join(cell['source'])
        if 'weights' in source and 'rfc_' in source:
            lines = source.split('\n')
            for line in lines:
                if 'weights' in line and '=' in line:
                    print(line)
                elif 'rfc_' in line or 'lgbm_' in line or 'xgb_' in line or 'cat_' in line:
                    print(line)
            break

print("\n=== KEY INSIGHT ===")
print("The winning kernel uses:")
print("- 4 different models (RandomForest, LGBM, XGBoost, CatBoost)")
print("- Different preprocessing for each model")
print("- Simple weighted averaging (not complex stacking)")
print("- Weights: rfc=0, lgbm=3, xgb=1, cat=0 (LGBM dominant)")
print("- 9-fold CV (more stable than our 5-fold)")

=== WINNING KERNEL ENSEMBLE WEIGHTS ===
weights = {"rfc_":0,
           "lgbm_":3,
           "xgb_":1,
           "cat_":0}
    tmp[f"{k}"] = (weights['rfc_']*tmp[f"rfc_{k}"] +
              weights['lgbm_']*tmp[f"lgbm_{k}"]+
              weights['xgb_']*tmp[f"xgb_{k}"]+
              weights['cat_']*tmp[f"cat_{k}"])    

=== KEY INSIGHT ===
The winning kernel uses:
- 4 different models (RandomForest, LGBM, XGBoost, CatBoost)
- Different preprocessing for each model
- Simple weighted averaging (not complex stacking)
- Weights: rfc=0, lgbm=3, xgb=1, cat=0 (LGBM dominant)
- 9-fold CV (more stable than our 5-fold)


## 3. Critical Differences Analysis

In [5]:
print("=== CRITICAL DIFFERENCES: US vs WINNING KERNEL ===")

print("\n1. MODEL DIVERSITY:")
print("  US: Only XGBoost")
print("  Winning: RandomForest + LGBM + XGBoost + CatBoost")
print("  Impact: HIGH - Ensemble diversity is key")

print("\n2. PREPROCESSING VARIETY:")
print("  US: Same preprocessing for all experiments")
print("  Winning: Different preprocessing per model (some get rounded features, some don't)")
print("  Impact: MEDIUM-HIGH - Captures different feature representations")

print("\n3. ENSEMBLING:")
print("  US: None (single model)")
print("  Winning: Weighted averaging with tuned weights")
print("  Impact: HIGH - Simple ensembles often beat complex single models")

print("\n4. HYPERPARAMETER TUNING:")
print("  US: Basic parameters or limited Optuna")
print("  Winning: Extensive Optuna tuning for each model")
print("  Impact: MEDIUM - Diminishing returns but still important")

print("\n5. CROSS-VALIDATION:")
print("  US: 5-fold")
print("  Winning: 9-fold (more stable estimates)")
print("  Impact: LOW-MEDIUM - More folds = more stable CV")

print("\n=== CONCLUSION ===")
print("The winning kernel's success comes from:")
print("1. Model diversity (4 different algorithms)")
print("2. Simple but effective ensembling")
print("3. Different preprocessing per model")
print("4. NOT just from MEstimateEncoder alone")

=== CRITICAL DIFFERENCES: US vs WINNING KERNEL ===

1. MODEL DIVERSITY:
  US: Only XGBoost
  Winning: RandomForest + LGBM + XGBoost + CatBoost
  Impact: HIGH - Ensemble diversity is key

2. PREPROCESSING VARIETY:
  US: Same preprocessing for all experiments
  Winning: Different preprocessing per model (some get rounded features, some don't)
  Impact: MEDIUM-HIGH - Captures different feature representations

3. ENSEMBLING:
  US: None (single model)
  Winning: Weighted averaging with tuned weights
  Impact: HIGH - Simple ensembles often beat complex single models

4. HYPERPARAMETER TUNING:
  US: Basic parameters or limited Optuna
  Winning: Extensive Optuna tuning for each model
  Impact: MEDIUM - Diminishing returns but still important

5. CROSS-VALIDATION:
  US: 5-fold
  Winning: 9-fold (more stable estimates)
  Impact: LOW-MEDIUM - More folds = more stable CV

=== CONCLUSION ===
The winning kernel's success comes from:
1. Model diversity (4 different algorithms)
2. Simple but effectiv

## 4. Why MEstimateEncoder Hurt Our Performance

In [None]:
print("=== MESTIMATEENCODER ANALYSIS ===")

print("\nHypothesis 1: Overfitting to training data")
print("- MEstimateEncoder creates target-derived features")
print("- With our enhanced features (WHO_BMI_Categories), we're double-encoding target signal")
print("- This causes overfitting → worse generalization")

print("\nHypothesis 2: Implementation differences")
print("- Winning kernel uses MEstimateEncoder WITHOUT our enhanced features")
print("- They use it on raw data with different model architectures")
print("- Context matters - encoder alone doesn't guarantee improvement")

print("\nHypothesis 3: CV leakage")
print("- If MEstimateEncoder is fit outside CV folds, it leaks target information")
print("- Our implementation uses ColumnTransformer, which should prevent this")
print("- But need to verify proper folding")

print("\nHypothesis 4: Diminishing returns")
print("- WHO_BMI_Categories already captures 71.88% of target signal")
print("- Adding MEstimateEncoder on top provides redundant information")
print("- More features ≠ better performance if they're correlated")

## 5. Recommendations for Next Experiment

In [None]:
print("=== RECOMMENDATIONS ===")

print("\nIMMEDIATE ACTIONS:")
print("1. SUBMIT exp_003 for LB calibration (urgent)")
print("   - Need to understand CV-LB gap")
print("   - Will reveal if our CV is optimistic/pessimistic")
print("   - Critical for diagnosing the decline")

print("\n2. FIX THE DECLINING SCORE TREND:")
print("   - Revert to exp_001 approach (enhanced features + OrdinalEncoder)")
print("   - That gave 0.9060, better than current 0.9052")
print("   - Build from stable baseline instead of declining one")

print("\n3. IMPLEMENT PROPER ENSEMBLING:")
print("   - Add LGBM model (winning kernel's top performer)")
print("   - Add CatBoost model (handles categoricals natively)")
print("   - Use simple weighted averaging like winning kernel")
print("   - Start with weights: lgbm=3, xgb=1 (based on winning kernel)")

print("\n4. MODEL-SPECIFIC PREPROCESSING:")
print("   - XGBoost: Use enhanced features (BMI, interactions)")
print("   - LGBM: Use raw features + MEstimateEncoder")
print("   - CatBoost: Use raw features (handles categoricals natively)")
print("   - This creates diversity like winning kernel")

print("\n5. INCREASE CV FOLDS:")
print("   - Move from 5-fold to 9-fold (like winning kernel)")
print("   - More stable CV estimates")
print("   - Better hyperparameter selection")

print("\n6. HYPERPARAMETER TUNING:")
print("   - Run Optuna for LGBM and CatBoost")
print("   - Use winning kernel's parameters as starting points")
print("   - Focus on key params: learning_rate, max_depth, subsample")