# Evolver Loop 12 Analysis

## Current Situation
- **Best CV (LOO)**: 0.0623 (exp_004/005 - PerTarget HGB+ETR)
- **Best LB**: 0.0956 (exp_004)
- **Latest CV (GroupKFold)**: 0.0844 (exp_012)
- **Target**: 0.01727

## Key Questions
1. Should we submit exp_012 to verify GroupKFold CV-LB correlation?
2. What approaches haven't been tried yet?
3. What's the gap between our best and the target?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Analyze experiment history
experiments = [
    {'name': 'exp_004 (PerTarget)', 'cv': 0.0623, 'lb': 0.0956, 'cv_type': 'LOO'},
    {'name': 'exp_006 (Ridge)', 'cv': 0.0688, 'lb': 0.0991, 'cv_type': 'LOO'},
    {'name': 'exp_012 (GroupKFold)', 'cv': 0.0844, 'lb': None, 'cv_type': 'GroupKFold'},
]

print("=== CV-LB Gap Analysis ===")
for exp in experiments:
    if exp['lb']:
        gap = (exp['lb'] - exp['cv']) / exp['cv'] * 100
        print(f"{exp['name']}: CV={exp['cv']:.4f}, LB={exp['lb']:.4f}, Gap={gap:.1f}%")
    else:
        print(f"{exp['name']}: CV={exp['cv']:.4f}, LB=Not submitted")

print("\n=== Key Insight ===")
print("LOO CV gives ~53% gap (0.0623 -> 0.0956)")
print("GroupKFold CV (0.0844) should be closer to LB")
print("Expected LB for exp_012: ~0.09-0.10 (if GroupKFold works)")
print("\nTarget: 0.01727 (4.9x lower than current best LB)")
print("This is a HUGE gap - need fundamentally different approach")

=== CV-LB Gap Analysis ===
exp_004 (PerTarget): CV=0.0623, LB=0.0956, Gap=53.5%
exp_006 (Ridge): CV=0.0688, LB=0.0991, Gap=44.0%
exp_012 (GroupKFold): CV=0.0844, LB=Not submitted

=== Key Insight ===
LOO CV gives ~53% gap (0.0623 -> 0.0956)
GroupKFold CV (0.0844) should be closer to LB
Expected LB for exp_012: ~0.09-0.10 (if GroupKFold works)

Target: 0.01727 (4.9x lower than current best LB)
This is a HUGE gap - need fundamentally different approach


In [2]:
# Analyze what approaches have been tried
approaches = {
    'Models': [
        ('MLP', 'Tried - exp_010, exp_011, exp_012'),
        ('XGBoost', 'Tried - in ensembles'),
        ('LightGBM', 'Tried - in ensembles'),
        ('RandomForest', 'Tried - exp_003, ensembles'),
        ('HGB', 'Tried - exp_004, exp_005 (best single solvent)'),
        ('ExtraTrees', 'Tried - exp_004, exp_005 (best products)'),
        ('Ridge', 'Tried - exp_006'),
        ('GaussianProcess', 'Tried - exp_008'),
    ],
    'Features': [
        ('Spange descriptors', 'Tried - most experiments'),
        ('ACS_PCA', 'Tried - exp_004, exp_007'),
        ('DRFP', 'Tried - exp_007'),
        ('Combined features', 'Tried - exp_004, exp_007'),
        ('Arrhenius kinetics', 'Tried - exp_001'),
    ],
    'Validation': [
        ('Leave-One-Out', 'Tried - exp_001-010'),
        ('GroupKFold (5-fold)', 'Tried - exp_011, exp_012'),
    ],
    'Ensembles': [
        ('MLP + GBDT', 'Tried - exp_010, exp_011, exp_012'),
        ('PerTarget + RF + XGB + LGB', 'Tried - exp_009'),
        ('Fixed weights', 'Tried - all ensembles'),
        ('Optuna weights', 'NOT TRIED - top kernel uses this!'),
    ],
}

print("=== Approaches Analysis ===")
for category, items in approaches.items():
    print(f"\n{category}:")
    for item, status in items:
        marker = '✅' if 'Tried' in status else '❌'
        print(f"  {marker} {item}: {status}")

print("\n=== CRITICAL GAP ===")
print("Optuna hyperparameter optimization has NOT been tried!")
print("Top kernel uses Optuna to optimize:")
print("  - lr (1e-4 to 1e-2)")
print("  - dropout (0.1-0.5)")
print("  - hidden_dims")
print("  - xgb_depth (3-8)")
print("  - rf_depth (5-15)")
print("  - lgb_leaves (15-63)")
print("  - ensemble weights (normalized)")
print("\nThis could be the key to closing the gap!")

=== Approaches Analysis ===

Models:
  ✅ MLP: Tried - exp_010, exp_011, exp_012
  ✅ XGBoost: Tried - in ensembles
  ✅ LightGBM: Tried - in ensembles
  ✅ RandomForest: Tried - exp_003, ensembles
  ✅ HGB: Tried - exp_004, exp_005 (best single solvent)
  ✅ ExtraTrees: Tried - exp_004, exp_005 (best products)
  ✅ Ridge: Tried - exp_006
  ✅ GaussianProcess: Tried - exp_008

Features:
  ✅ Spange descriptors: Tried - most experiments
  ✅ ACS_PCA: Tried - exp_004, exp_007
  ✅ DRFP: Tried - exp_007
  ✅ Combined features: Tried - exp_004, exp_007
  ✅ Arrhenius kinetics: Tried - exp_001

Validation:
  ✅ Leave-One-Out: Tried - exp_001-010
  ✅ GroupKFold (5-fold): Tried - exp_011, exp_012

Ensembles:
  ✅ MLP + GBDT: Tried - exp_010, exp_011, exp_012
  ✅ PerTarget + RF + XGB + LGB: Tried - exp_009
  ✅ Fixed weights: Tried - all ensembles
  ❌ Optuna weights: NOT TRIED - top kernel uses this!

=== CRITICAL GAP ===
Optuna hyperparameter optimization has NOT been tried!
Top kernel uses Optuna to optimiz

In [3]:
# Analyze the target gap
target = 0.01727
best_lb = 0.0956
best_cv_loo = 0.0623
best_cv_gkf = 0.0844

print("=== Target Gap Analysis ===")
print(f"Target: {target}")
print(f"Best LB: {best_lb} ({best_lb/target:.1f}x from target)")
print(f"Best CV (LOO): {best_cv_loo} ({best_cv_loo/target:.1f}x from target)")
print(f"Best CV (GroupKFold): {best_cv_gkf} ({best_cv_gkf/target:.1f}x from target)")

print("\n=== What would it take to reach target? ===")
print(f"Need to reduce LB from {best_lb:.4f} to {target:.4f}")
print(f"That's an {(best_lb - target)/best_lb * 100:.1f}% reduction")
print("\nPossible paths:")
print("1. Optuna hyperparameter optimization (top kernel approach)")
print("2. Better feature engineering for unseen solvents")
print("3. Per-target models with Optuna tuning")
print("4. Different model architectures (transformers, GNNs)")
print("5. Domain-specific features (chemical properties)")

print("\n=== Submission Strategy ===")
print("We have 3 submissions remaining.")
print("Option 1: Submit exp_012 to verify GroupKFold CV-LB correlation")
print("Option 2: Implement Optuna first, then submit")
print("\nRecommendation: Submit exp_012 first to calibrate, then iterate")

=== Target Gap Analysis ===
Target: 0.01727
Best LB: 0.0956 (5.5x from target)
Best CV (LOO): 0.0623 (3.6x from target)
Best CV (GroupKFold): 0.0844 (4.9x from target)

=== What would it take to reach target? ===
Need to reduce LB from 0.0956 to 0.0173
That's an 81.9% reduction

Possible paths:
1. Optuna hyperparameter optimization (top kernel approach)
2. Better feature engineering for unseen solvents
3. Per-target models with Optuna tuning
4. Different model architectures (transformers, GNNs)
5. Domain-specific features (chemical properties)

=== Submission Strategy ===
We have 3 submissions remaining.
Option 1: Submit exp_012 to verify GroupKFold CV-LB correlation
Option 2: Implement Optuna first, then submit

Recommendation: Submit exp_012 first to calibrate, then iterate


In [4]:
# Analyze top kernel architecture more closely
print("=== Top Kernel (lishellliang) Architecture ===")
print("\n1. Validation: GroupKFold (5-fold) - SAME as exp_012")
print("\n2. Model: EnsembleModel")
print("   - MLP: [128, 64, 32] hidden dims, BatchNorm, ReLU, Dropout")
print("   - XGBoost: n_estimators=100, max_depth=6")
print("   - RandomForest: n_estimators=100, max_depth=10")
print("   - LightGBM: n_estimators=100, num_leaves=31")
print("   - Weights: [0.4, 0.2, 0.2, 0.2] (default)")
print("\n3. Features: Spange descriptors only")
print("\n4. Key difference: Optuna optimization (commented out in kernel)")
print("   - Optimizes: lr, dropout, hidden_dims, xgb_depth, rf_depth, lgb_leaves, weights")
print("   - n_trials=50")
print("\n5. Our exp_012 differences:")
print("   - MLP: 100 epochs (vs their 50 in Optuna)")
print("   - XGBoost: n_estimators=300 (vs their 100)")
print("   - RandomForest: n_estimators=300, max_depth=15 (vs their 100, 10)")
print("   - LightGBM: n_estimators=300 (vs their 100)")
print("\nOur models are LARGER but not necessarily BETTER")
print("Optuna could find better hyperparameters with smaller models")

=== Top Kernel (lishellliang) Architecture ===

1. Validation: GroupKFold (5-fold) - SAME as exp_012

2. Model: EnsembleModel
   - MLP: [128, 64, 32] hidden dims, BatchNorm, ReLU, Dropout
   - XGBoost: n_estimators=100, max_depth=6
   - RandomForest: n_estimators=100, max_depth=10
   - LightGBM: n_estimators=100, num_leaves=31
   - Weights: [0.4, 0.2, 0.2, 0.2] (default)

3. Features: Spange descriptors only

4. Key difference: Optuna optimization (commented out in kernel)
   - Optimizes: lr, dropout, hidden_dims, xgb_depth, rf_depth, lgb_leaves, weights
   - n_trials=50

5. Our exp_012 differences:
   - MLP: 100 epochs (vs their 50 in Optuna)
   - XGBoost: n_estimators=300 (vs their 100)
   - RandomForest: n_estimators=300, max_depth=15 (vs their 100, 10)
   - LightGBM: n_estimators=300 (vs their 100)

Our models are LARGER but not necessarily BETTER
Optuna could find better hyperparameters with smaller models


In [5]:
# Final recommendation
print("="*70)
print("STRATEGIC RECOMMENDATION")
print("="*70)

print("\n1. SUBMIT exp_012 NOW")
print("   - Verify GroupKFold CV-LB correlation")
print("   - Expected LB: ~0.09-0.10")
print("   - This calibrates our CV estimates")

print("\n2. NEXT EXPERIMENT: Optuna Optimization")
print("   - Implement Optuna like top kernel")
print("   - Optimize: lr, dropout, hidden_dims, depths, weights")
print("   - Use GroupKFold for validation")
print("   - This is the KEY MISSING PIECE")

print("\n3. ALTERNATIVE: Per-Target with Optuna")
print("   - exp_004/005 showed per-target works well")
print("   - Combine with Optuna for each target")
print("   - HGB for SM, ETR for Products")

print("\n4. FEATURE ENGINEERING")
print("   - Try different feature combinations with Optuna")
print("   - DRFP + Spange, ACS_PCA + Spange")
print("   - Let Optuna find optimal combination")

print("\n" + "="*70)
print("IMMEDIATE ACTION: Submit exp_012 to verify CV-LB correlation")
print("="*70)

STRATEGIC RECOMMENDATION

1. SUBMIT exp_012 NOW
   - Verify GroupKFold CV-LB correlation
   - Expected LB: ~0.09-0.10
   - This calibrates our CV estimates

2. NEXT EXPERIMENT: Optuna Optimization
   - Implement Optuna like top kernel
   - Optimize: lr, dropout, hidden_dims, depths, weights
   - Use GroupKFold for validation
   - This is the KEY MISSING PIECE

3. ALTERNATIVE: Per-Target with Optuna
   - exp_004/005 showed per-target works well
   - Combine with Optuna for each target
   - HGB for SM, ETR for Products

4. FEATURE ENGINEERING
   - Try different feature combinations with Optuna
   - DRFP + Spange, ACS_PCA + Spange
   - Let Optuna find optimal combination

IMMEDIATE ACTION: Submit exp_012 to verify CV-LB correlation
