# Evolver Loop 13 Analysis

## Key Questions:
1. What is the CV-LB gap pattern across submissions?
2. What approaches have worked vs not worked?
3. What is the path to beating 0.01727?

In [1]:
import pandas as pd
import numpy as np

# Submission history analysis
submissions = [
    {'exp': 'exp_004', 'cv': 0.0623, 'lb': 0.0956, 'model': 'PerTarget HGB+ETR (LOO)'},
    {'exp': 'exp_006', 'cv': 0.0688, 'lb': 0.0991, 'model': 'PerTarget depth=5/7 (LOO)'},
    {'exp': 'exp_011', 'cv': 0.0844, 'lb': 'pending', 'model': 'MLP+GBDT GroupKFold'},
]

print('=== SUBMISSION HISTORY ===')
for s in submissions:
    if s['lb'] != 'pending':
        gap = (s['lb'] - s['cv']) / s['cv'] * 100
        print(f"{s['exp']}: CV={s['cv']:.4f} -> LB={s['lb']:.4f} (gap: +{gap:.1f}%)")
    else:
        print(f"{s['exp']}: CV={s['cv']:.4f} -> LB={s['lb']}")

print('\n=== KEY INSIGHT ===')
print('exp_004 (best CV 0.0623) had 53% CV-LB gap')
print('exp_006 (worse CV 0.0688) had 44% CV-LB gap')
print('More regularization made LB WORSE (0.0956 -> 0.0991)')
print('This means the problem is NOT traditional overfitting!')

=== SUBMISSION HISTORY ===
exp_004: CV=0.0623 -> LB=0.0956 (gap: +53.5%)
exp_006: CV=0.0688 -> LB=0.0991 (gap: +44.0%)
exp_011: CV=0.0844 -> LB=pending

=== KEY INSIGHT ===
exp_004 (best CV 0.0623) had 53% CV-LB gap
exp_006 (worse CV 0.0688) had 44% CV-LB gap
More regularization made LB WORSE (0.0956 -> 0.0991)
This means the problem is NOT traditional overfitting!


In [2]:
# Analyze all experiments
experiments = [
    {'id': 'exp_000', 'cv': 0.0814, 'model': 'Baseline ensemble'},
    {'id': 'exp_001', 'cv': 0.0810, 'model': 'Template compliant'},
    {'id': 'exp_002', 'cv': 0.0805, 'model': 'Simple RF'},
    {'id': 'exp_003', 'cv': 0.0813, 'model': 'PerTarget (with TTA)'},
    {'id': 'exp_004', 'cv': 0.0623, 'model': 'PerTarget NO TTA', 'lb': 0.0956},
    {'id': 'exp_005', 'cv': 0.0896, 'model': 'Ridge baseline'},
    {'id': 'exp_006', 'cv': 0.0688, 'model': 'Intermediate reg', 'lb': 0.0991},
    {'id': 'exp_007', 'cv': 0.0721, 'model': 'Gaussian Process'},
    {'id': 'exp_008', 'cv': 0.0673, 'model': 'Diverse ensemble'},
    {'id': 'exp_009', 'cv': 0.0669, 'model': 'MLP+GBDT ensemble'},
    {'id': 'exp_010', 'cv': 0.0841, 'model': 'GroupKFold MLP+GBDT'},
    {'id': 'exp_011', 'cv': 0.0844, 'model': 'GroupKFold template'},
    {'id': 'exp_012', 'cv': 0.0827, 'model': 'LOO MLP+GBDT'},
]

print('=== EXPERIMENT RANKING BY CV ===')
sorted_exp = sorted(experiments, key=lambda x: x['cv'])
for e in sorted_exp:
    lb_str = f" -> LB={e.get('lb', 'N/A')}" if 'lb' in e else ''
    print(f"{e['id']}: CV={e['cv']:.4f} - {e['model']}{lb_str}")

=== EXPERIMENT RANKING BY CV ===
exp_004: CV=0.0623 - PerTarget NO TTA -> LB=0.0956
exp_009: CV=0.0669 - MLP+GBDT ensemble
exp_008: CV=0.0673 - Diverse ensemble
exp_006: CV=0.0688 - Intermediate reg -> LB=0.0991
exp_007: CV=0.0721 - Gaussian Process
exp_002: CV=0.0805 - Simple RF
exp_001: CV=0.0810 - Template compliant
exp_003: CV=0.0813 - PerTarget (with TTA)
exp_000: CV=0.0814 - Baseline ensemble
exp_012: CV=0.0827 - LOO MLP+GBDT
exp_010: CV=0.0841 - GroupKFold MLP+GBDT
exp_011: CV=0.0844 - GroupKFold template
exp_005: CV=0.0896 - Ridge baseline


In [3]:
# Critical analysis: What's working?
print('=== WHAT WORKS ===')
print('1. Per-target models (HGB for SM, ETR for Products) - best CV 0.0623')
print('2. NO TTA - TTA hurts mixed solvent performance')
print('3. LOO validation is REQUIRED for submission')
print('')
print('=== WHAT DOESN\'T WORK ===')
print('1. More regularization - made LB worse')
print('2. GroupKFold in template cells - breaks submission')
print('3. TTA - hurts performance')
print('4. Simple models (Ridge) - worse CV')
print('')
print('=== GAP ANALYSIS ===')
print(f'Target: 0.01727')
print(f'Best LB: 0.0956')
print(f'Gap: {0.0956/0.01727:.1f}x (5.5x worse than target)')
print('')
print('This is a HUGE gap. We need fundamentally different approaches.')

=== WHAT WORKS ===
1. Per-target models (HGB for SM, ETR for Products) - best CV 0.0623
2. NO TTA - TTA hurts mixed solvent performance
3. LOO validation is REQUIRED for submission

=== WHAT DOESN'T WORK ===
1. More regularization - made LB worse
2. GroupKFold in template cells - breaks submission
3. TTA - hurts performance
4. Simple models (Ridge) - worse CV

=== GAP ANALYSIS ===
Target: 0.01727
Best LB: 0.0956
Gap: 5.5x (5.5x worse than target)

This is a HUGE gap. We need fundamentally different approaches.


In [4]:
# Top kernel analysis
print('=== TOP KERNEL (lishellliang) ANALYSIS ===')
print('Key features:')
print('1. Uses GroupKFold (5-fold) internally - OVERWRITES utility functions')
print('2. MLP + XGBoost + RF + LightGBM ensemble')
print('3. Optuna hyperparameter optimization (commented out)')
print('4. Spange descriptors only')
print('5. Weights: learned via Optuna')
print('')
print('CRITICAL INSIGHT:')
print('The top kernel OVERWRITES generate_leave_one_out_splits to use GroupKFold!')
print('This means their submission.csv has 5 folds, not 24.')
print('Yet they got a valid LB score.')
print('')
print('POSSIBLE EXPLANATIONS:')
print('1. The evaluation metric may accept different fold counts')
print('2. Or the kernel was submitted before metric was finalized')
print('3. Or there\'s a different version that uses LOO')

=== TOP KERNEL (lishellliang) ANALYSIS ===
Key features:
1. Uses GroupKFold (5-fold) internally - OVERWRITES utility functions
2. MLP + XGBoost + RF + LightGBM ensemble
3. Optuna hyperparameter optimization (commented out)
4. Spange descriptors only
5. Weights: learned via Optuna

CRITICAL INSIGHT:
The top kernel OVERWRITES generate_leave_one_out_splits to use GroupKFold!
This means their submission.csv has 5 folds, not 24.
Yet they got a valid LB score.

POSSIBLE EXPLANATIONS:
1. The evaluation metric may accept different fold counts
2. Or the kernel was submitted before metric was finalized
3. Or there's a different version that uses LOO


In [5]:
# Strategy analysis
print('=== STRATEGIC OPTIONS ===')
print('')
print('OPTION 1: Optuna optimization with LOO')
print('- Implement Optuna to find optimal hyperparameters')
print('- Use LOO for final submission')
print('- Risk: May not close the 5.5x gap')
print('')
print('OPTION 2: Better feature engineering')
print('- Focus on features that generalize to unseen solvents')
print('- Chemical properties: polarity, dielectric constant, H-bonding')
print('- Risk: May not have enough domain knowledge')
print('')
print('OPTION 3: Per-target + MLP hybrid')
print('- Use best of both approaches')
print('- Per-target models achieved best CV (0.0623)')
print('- MLP may capture non-linear patterns')
print('')
print('OPTION 4: Submit exp_013 to verify MLP+GBDT LB')
print('- exp_013 has LOO with correct fold structure')
print('- CV 0.0827 is worse than exp_004 (0.0623)')
print('- But may have smaller CV-LB gap')

=== STRATEGIC OPTIONS ===

OPTION 1: Optuna optimization with LOO
- Implement Optuna to find optimal hyperparameters
- Use LOO for final submission
- Risk: May not close the 5.5x gap

OPTION 2: Better feature engineering
- Focus on features that generalize to unseen solvents
- Chemical properties: polarity, dielectric constant, H-bonding
- Risk: May not have enough domain knowledge

OPTION 3: Per-target + MLP hybrid
- Use best of both approaches
- Per-target models achieved best CV (0.0623)
- MLP may capture non-linear patterns

OPTION 4: Submit exp_013 to verify MLP+GBDT LB
- exp_013 has LOO with correct fold structure
- CV 0.0827 is worse than exp_004 (0.0623)
- But may have smaller CV-LB gap


In [6]:
# Submission strategy
print('=== SUBMISSION STRATEGY ===')
print('Remaining submissions: 3')
print('')
print('Current pending: exp_011 (GroupKFold) - may fail like exp_012')
print('')
print('Recommended next steps:')
print('1. Wait for exp_011 LB result')
print('2. If exp_011 fails, submit exp_013 (LOO MLP+GBDT)')
print('3. Implement Optuna optimization')
print('4. Try per-target + MLP hybrid')
print('')
print('KEY INSIGHT:')
print('The target (0.01727) is 5.5x lower than our best LB.')
print('This suggests we need a fundamentally different approach.')
print('Possible directions:')
print('- GNN with learned solvent encodings (arxiv:2512.19530)')
print('- Transfer learning from larger chemical datasets')
print('- Better chemical features for unseen solvents')

=== SUBMISSION STRATEGY ===
Remaining submissions: 3

Current pending: exp_011 (GroupKFold) - may fail like exp_012

Recommended next steps:
1. Wait for exp_011 LB result
2. If exp_011 fails, submit exp_013 (LOO MLP+GBDT)
3. Implement Optuna optimization
4. Try per-target + MLP hybrid

KEY INSIGHT:
The target (0.01727) is 5.5x lower than our best LB.
This suggests we need a fundamentally different approach.
Possible directions:
- GNN with learned solvent encodings (arxiv:2512.19530)
- Transfer learning from larger chemical datasets
- Better chemical features for unseen solvents
