# Loop 9 Analysis: Understanding the CV-LB Gap and Top Kernel Approaches

## Key Questions:
1. Why did the diverse ensemble (exp_009) have worse CV than per-target (exp_004)?
2. What does the top kernel (lishellliang) do differently?
3. What is the optimal strategy going forward?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Load experiment results
experiments = [
    ('exp_000', '001_baseline', 0.081393, None),
    ('exp_001', '002_template', 0.081044, None),
    ('exp_002', '003_rf', 0.08053, None),
    ('exp_003', '004_per_target', 0.08126, None),
    ('exp_004', '005_no_tta', 0.062265, 0.0956),  # Submitted
    ('exp_005', '006_ridge', 0.08964, None),
    ('exp_006', '007_intermediate', 0.068848, 0.0991),  # Submitted
    ('exp_007', '008_gp', 0.072118, None),
    ('exp_008', '009_diverse', 0.067268, None),
]

df = pd.DataFrame(experiments, columns=['exp_id', 'name', 'cv_score', 'lb_score'])
print("Experiment Summary:")
print(df.to_string(index=False))
print(f"\nBest CV: {df['cv_score'].min():.4f} ({df.loc[df['cv_score'].idxmin(), 'name']})")
print(f"Best LB: {df['lb_score'].min():.4f} ({df.loc[df['lb_score'].idxmin(), 'name']})")

Experiment Summary:
 exp_id             name  cv_score  lb_score
exp_000     001_baseline  0.081393       NaN
exp_001     002_template  0.081044       NaN
exp_002           003_rf  0.080530       NaN
exp_003   004_per_target  0.081260       NaN
exp_004       005_no_tta  0.062265    0.0956
exp_005        006_ridge  0.089640       NaN
exp_006 007_intermediate  0.068848    0.0991
exp_007           008_gp  0.072118       NaN
exp_008      009_diverse  0.067268       NaN

Best CV: 0.0623 (005_no_tta)
Best LB: 0.0956 (005_no_tta)


In [2]:
# Analyze CV-LB gap
submitted = df[df['lb_score'].notna()]
print("\nSubmitted Experiments:")
for _, row in submitted.iterrows():
    gap = (row['lb_score'] - row['cv_score']) / row['cv_score'] * 100
    print(f"{row['name']}: CV={row['cv_score']:.4f} -> LB={row['lb_score']:.4f} (Gap: +{gap:.1f}%)")

# Key insight: exp_006 (more regularization) had WORSE LB than exp_004
print("\n" + "="*60)
print("CRITICAL INSIGHT:")
print("exp_006 (more regularization) had WORSE LB than exp_004")
print("This DISPROVES the overfitting hypothesis!")
print("We need BETTER features/models, not simpler ones.")
print("="*60)


Submitted Experiments:
005_no_tta: CV=0.0623 -> LB=0.0956 (Gap: +53.5%)
007_intermediate: CV=0.0688 -> LB=0.0991 (Gap: +43.9%)

CRITICAL INSIGHT:
exp_006 (more regularization) had WORSE LB than exp_004
This DISPROVES the overfitting hypothesis!
We need BETTER features/models, not simpler ones.


In [3]:
# Analyze top kernel approach (lishellliang)
print("\nTop Kernel (lishellliang) Key Differences:")
print("="*60)
print("1. Uses GroupKFold (5-fold) instead of Leave-One-Out")
print("   - This gives more realistic CV estimates")
print("   - Each fold has ~20% of solvents held out")
print("   - More training data per fold (80% vs ~96% for LOO)")
print("")
print("2. Uses MLP + XGBoost + RF + LightGBM ensemble")
print("   - Weighted averaging with learned weights (Optuna)")
print("   - MLP captures non-linear patterns")
print("   - GBDT models capture different patterns")
print("")
print("3. Uses Spange descriptors only (not combined features)")
print("   - Simpler feature set may generalize better")
print("")
print("4. Uses Optuna for hyperparameter tuning")
print("   - Learns optimal weights for ensemble")
print("   - Tunes MLP architecture, GBDT depths, etc.")


Top Kernel (lishellliang) Key Differences:
1. Uses GroupKFold (5-fold) instead of Leave-One-Out
   - This gives more realistic CV estimates
   - Each fold has ~20% of solvents held out
   - More training data per fold (80% vs ~96% for LOO)

2. Uses MLP + XGBoost + RF + LightGBM ensemble
   - Weighted averaging with learned weights (Optuna)
   - MLP captures non-linear patterns
   - GBDT models capture different patterns

3. Uses Spange descriptors only (not combined features)
   - Simpler feature set may generalize better

4. Uses Optuna for hyperparameter tuning
   - Learns optimal weights for ensemble
   - Tunes MLP architecture, GBDT depths, etc.


In [4]:
# Compare our approach vs top kernel
print("\nOur Approach vs Top Kernel:")
print("="*60)
print("| Aspect              | Our Approach           | Top Kernel             |")
print("|---------------------|------------------------|------------------------|")
print("| CV Scheme           | Leave-One-Out (24/13)  | GroupKFold (5-fold)    |")
print("| Base Models         | HGB+ETR (per-target)   | MLP+XGB+RF+LGB         |")
print("| Ensemble Weights    | Fixed [0.4,0.2,0.2,0.2]| Learned (Optuna)       |")
print("| Features            | Spange+ACS_PCA+DRFP    | Spange only            |")
print("| Best CV             | 0.0623                 | Unknown                |")
print("| Best LB             | 0.0956                 | ~0.08-0.09 (estimated) |")


Our Approach vs Top Kernel:
| Aspect              | Our Approach           | Top Kernel             |
|---------------------|------------------------|------------------------|
| CV Scheme           | Leave-One-Out (24/13)  | GroupKFold (5-fold)    |
| Base Models         | HGB+ETR (per-target)   | MLP+XGB+RF+LGB         |
| Ensemble Weights    | Fixed [0.4,0.2,0.2,0.2]| Learned (Optuna)       |
| Features            | Spange+ACS_PCA+DRFP    | Spange only            |
| Best CV             | 0.0623                 | Unknown                |
| Best LB             | 0.0956                 | ~0.08-0.09 (estimated) |


In [5]:
# What should we try next?
print("\nRecommended Next Steps:")
print("="*60)
print("")
print("PRIORITY 1: Include MLP in ensemble")
print("- Top kernel uses MLP as a key component")
print("- MLP can capture non-linear patterns that trees miss")
print("- Our current ensemble doesn't have MLP")
print("")
print("PRIORITY 2: Learn ensemble weights with Optuna")
print("- Fixed weights [0.4,0.2,0.2,0.2] may be suboptimal")
print("- Optuna can find optimal weights for each model")
print("")
print("PRIORITY 3: Try GroupKFold validation")
print("- May give more realistic CV estimates")
print("- Better correlation with LB")
print("")
print("PRIORITY 4: Simplify features")
print("- Top kernel uses Spange only")
print("- Combined features may cause overfitting")


Recommended Next Steps:

PRIORITY 1: Include MLP in ensemble
- Top kernel uses MLP as a key component
- MLP can capture non-linear patterns that trees miss
- Our current ensemble doesn't have MLP

PRIORITY 2: Learn ensemble weights with Optuna
- Fixed weights [0.4,0.2,0.2,0.2] may be suboptimal
- Optuna can find optimal weights for each model

PRIORITY 3: Try GroupKFold validation
- May give more realistic CV estimates
- Better correlation with LB

PRIORITY 4: Simplify features
- Top kernel uses Spange only
- Combined features may cause overfitting


In [6]:
# Gap analysis
print("\nGap to Target Analysis:")
print("="*60)
target = 0.01727
best_cv = 0.0623
best_lb = 0.0956

print(f"Target: {target}")
print(f"Best CV: {best_cv} ({best_cv/target:.1f}x target)")
print(f"Best LB: {best_lb} ({best_lb/target:.1f}x target)")
print(f"")
print(f"To reach target from best LB:")
print(f"  Need {(best_lb - target)/best_lb * 100:.1f}% improvement")
print(f"  That's {best_lb/target:.1f}x better than current")
print(f"")
print("This is a MASSIVE gap. The target suggests:")
print("1. There's domain knowledge we're missing")
print("2. There's a fundamentally different approach")
print("3. The test set may have different characteristics")


Gap to Target Analysis:
Target: 0.01727
Best CV: 0.0623 (3.6x target)
Best LB: 0.0956 (5.5x target)

To reach target from best LB:
  Need 81.9% improvement
  That's 5.5x better than current

This is a MASSIVE gap. The target suggests:
1. There's domain knowledge we're missing
2. There's a fundamentally different approach
3. The test set may have different characteristics


In [7]:
# Final recommendation
print("\n" + "="*60)
print("FINAL RECOMMENDATION FOR NEXT EXPERIMENT")
print("="*60)
print("")
print("Experiment 010: MLP + GBDT Ensemble with Learned Weights")
print("")
print("Key changes from exp_009:")
print("1. ADD MLP to ensemble (like top kernel)")
print("2. Use Optuna to learn optimal weights")
print("3. Try Spange-only features (simpler may generalize better)")
print("4. Keep Leave-One-Out CV for now (more rigorous)")
print("")
print("Expected outcome:")
print("- CV may be similar or slightly worse")
print("- LB may improve due to MLP capturing different patterns")
print("- Learned weights should optimize ensemble performance")


FINAL RECOMMENDATION FOR NEXT EXPERIMENT

Experiment 010: MLP + GBDT Ensemble with Learned Weights

Key changes from exp_009:
1. ADD MLP to ensemble (like top kernel)
2. Use Optuna to learn optimal weights
3. Try Spange-only features (simpler may generalize better)
4. Keep Leave-One-Out CV for now (more rigorous)

Expected outcome:
- CV may be similar or slightly worse
- LB may improve due to MLP capturing different patterns
- Learned weights should optimize ensemble performance
