# Evolver Loop 24 Analysis

## Critical Situation Assessment

**Status:**
- Best LB: 0.0956 (exp_004, exp_016)
- Target: 0.01727 (5.5x gap)
- Submissions remaining: 1
- 24 experiments completed

**Key Question:** What is the path to reaching the target?

In [6]:
import pandas as pd
import numpy as np
import json

# Load session state
with open('/home/code/session_state.json', 'r') as f:
    state = json.load(f)

# Analyze all experiments
experiments = state['experiments']
print(f"Total experiments: {len(experiments)}")
print(f"\n=== SUBMISSION HISTORY ===")
for sub in state['submissions']:
    print(f"  {sub['experiment_id']}: CV={sub['cv_score']:.4f}, LB={sub['lb_score']}")

print(f"\n=== CV-LB GAP ANALYSIS ===")
for sub in state['submissions']:
    lb = sub['lb_score']
    if lb != 'pending' and isinstance(lb, (int, float)):
        gap = (lb - sub['cv_score']) / sub['cv_score'] * 100
        print(f"  {sub['experiment_id']}: CV={sub['cv_score']:.4f} -> LB={lb:.4f} (gap: {gap:.1f}%)")

Total experiments: 24

=== SUBMISSION HISTORY ===
  exp_004: CV=0.0623, LB=0.09558
  exp_006: CV=0.0688, LB=0.0991
  exp_011: CV=0.0844, LB=
  exp_016: CV=0.0623, LB=0.09558
  exp_021: CV=0.0901, LB=0.12314

=== CV-LB GAP ANALYSIS ===
  exp_004: CV=0.0623 -> LB=0.0956 (gap: 53.5%)
  exp_006: CV=0.0688 -> LB=0.0991 (gap: 43.9%)
  exp_016: CV=0.0623 -> LB=0.0956 (gap: 53.4%)
  exp_021: CV=0.0901 -> LB=0.1231 (gap: 36.7%)


In [7]:
# Analyze what approaches have been tried
print("=== APPROACHES TRIED ===")
for exp in experiments:
    print(f"{exp['id']}: {exp['model_type']} -> CV={exp['score']:.4f}")
    
print(f"\n=== BEST EXPERIMENTS ===")
sorted_exp = sorted(experiments, key=lambda x: x['score'])
for exp in sorted_exp[:5]:
    print(f"  {exp['id']}: {exp['model_type']} -> CV={exp['score']:.4f}")
    print(f"    Notes: {exp['notes'][:200]}...")


=== APPROACHES TRIED ===
exp_000: ensemble (MLP+XGB+LGB+RF) -> CV=0.0814
exp_001: ensemble (MLP+XGB+LGB+RF) -> CV=0.0810
exp_002: RandomForest -> CV=0.0805
exp_003: PerTarget (HGB+ETR) -> CV=0.0813
exp_004: PerTarget (HGB+ETR) NO TTA -> CV=0.0623
exp_005: Ridge (alpha=10.0) -> CV=0.0896
exp_006: PerTarget (HGB+ETR) depth=5/7 -> CV=0.0688
exp_007: GaussianProcess (Matern) -> CV=0.0721
exp_008: Ensemble (PerTarget+RF+XGB+LGB) -> CV=0.0673
exp_009: MLP + XGBoost + RF + LightGBM Ensemble -> CV=0.0669
exp_010: MLP + XGBoost + RF + LightGBM Ensemble with GroupKFold -> CV=0.0841
exp_011: MLP + XGBoost + RF + LightGBM Ensemble with GroupKFold -> CV=0.0844
exp_012: MLP + XGBoost + RF + LightGBM Ensemble with LOO -> CV=0.0827
exp_013: Optuna-optimized HGB + ExtraTrees (Per-Target) -> CV=0.0834
exp_014: MLP + HGB + ETR Hybrid with Optuna Weights -> CV=0.0891
exp_015: Hybrid HGB + ETR + MLP (Task-Specific) -> CV=0.0830
exp_016: HGB + ETR Per-Target with PREDICTION Combination -> CV=0.0623
exp_017:

In [8]:
# Key insight: The CV-LB gap is ~50% consistently
# This means to get LB 0.01727, we need CV ~0.0115
# Our best CV is 0.0623 - need 82% improvement!

print("=== TARGET ANALYSIS ===")
target_lb = 0.01727
best_cv = 0.0623
best_lb = 0.0956

print(f"Target LB: {target_lb}")
print(f"Best CV: {best_cv}")
print(f"Best LB: {best_lb}")
print(f"\nCV-LB gap: {(best_lb - best_cv) / best_cv * 100:.1f}%")
print(f"\nTo reach target LB {target_lb}:")
print(f"  If gap is 50%: need CV = {target_lb / 1.5:.4f}")
print(f"  If gap is 53%: need CV = {target_lb / 1.53:.4f}")
print(f"  Current best CV: {best_cv}")
print(f"  Improvement needed: {(best_cv - target_lb/1.5) / best_cv * 100:.1f}%")


=== TARGET ANALYSIS ===
Target LB: 0.01727
Best CV: 0.0623
Best LB: 0.0956

CV-LB gap: 53.5%

To reach target LB 0.01727:
  If gap is 50%: need CV = 0.0115
  If gap is 53%: need CV = 0.0113
  Current best CV: 0.0623
  Improvement needed: 81.5%


In [9]:
# The key question: Is the target achievable?
# Let's look at what the top kernels/solutions are doing

print("=== RESEARCH FINDINGS ===")
for finding in state.get('data_findings', []):
    finding_text = finding.get('finding', '') if isinstance(finding, dict) else str(finding)
    if 'GNN' in finding_text or 'graph' in finding_text.lower() or 'transfer' in finding_text.lower():
        print(f"\n{finding_text[:500]}...")

=== RESEARCH FINDINGS ===

Key strategies to reduce overfitting for unseen molecules: 1) OOD-aware validation (scaffold-based or k-fold n-step forward), 2) Strong regularization + early stopping, 3) Ensemble of diverse learners, 4) Transfer learning from large chemical datasets, 5) Data augmentation with chemical variations, 6) Uncertainty quantification to prune low-confidence predictions....

BREAKTHROUGH: Paper arxiv:2512.19530 achieves MSE 0.0039 (MAE ~0.062) on Catechol benchmark using GNN with GAT + DRFP + learned mixture-aware solvent encodings. This is 25x better than tabular ensembles (MSE 0.099). Key: explicit molecular graph message-passing and continuous mixture encoding are essential for generalization....

WEB RESEARCH: For OOD molecular property prediction, best approaches are: 1) GNN pre-training + transfer learning (8x improvement), 2) Task-similarity-driven source selection (PGM/MoTSE), 3) Chemical-knowledge-aware regularization (MolRuleLoss: 2-33% RMSE reduction), 4)

In [10]:
# Critical insight from web research:
# - GNN-based approach achieves MSE 0.0039 (~MAE 0.062) on Catechol benchmark
# - This is 25x better than tabular ensembles (MSE 0.099)
# - Key components: Graph Attention Networks, pre-training, transfer learning

# BUT: We are constrained by the template structure!
# The template requires specific model interface and LOO validation

print("=== CONSTRAINTS ===")
print("1. Template compliance: Last 3 cells must be unchanged")
print("2. LOO validation: 24 folds for single, 13 folds for full")
print("3. Model interface: train_model(X, Y), predict(X)")
print("4. No external data or pre-trained models")

print("\n=== WHAT CAN WE CHANGE? ===")
print("1. Model architecture (the model = XXX line)")
print("2. Feature engineering (before template cells)")
print("3. Hyperparameters")
print("4. Ensemble strategies")


=== CONSTRAINTS ===
1. Template compliance: Last 3 cells must be unchanged
2. LOO validation: 24 folds for single, 13 folds for full
3. Model interface: train_model(X, Y), predict(X)
4. No external data or pre-trained models

=== WHAT CAN WE CHANGE? ===
1. Model architecture (the model = XXX line)
2. Feature engineering (before template cells)
3. Hyperparameters
4. Ensemble strategies


In [11]:
# Let's analyze what the lishellliang kernel does differently
# They redefine the split functions BEFORE the template cells
# This is template-compliant because the last 3 cells are unchanged

# BUT: The submission format changes!
# - LOO: 24 folds for single, 13 folds for full
# - GroupKFold(5): 5 folds for single, 5 folds for full

# This is why exp_011/012 failed - the submission format was different

print("=== LISHELLLIANG KERNEL ANALYSIS ===")
print("Key insight: They redefine split functions to use GroupKFold(5)")
print("This gives more realistic CV estimates (closer to LB)")
print("")
print("BUT: The submission format changes!")
print("  LOO: 24 folds (single) + 13 folds (full) = 37 total folds")
print("  GroupKFold(5): 5 folds (single) + 5 folds (full) = 10 total folds")
print("")
print("This is why exp_011/012 failed - different fold structure")


=== LISHELLLIANG KERNEL ANALYSIS ===
Key insight: They redefine split functions to use GroupKFold(5)
This gives more realistic CV estimates (closer to LB)

BUT: The submission format changes!
  LOO: 24 folds (single) + 13 folds (full) = 37 total folds
  GroupKFold(5): 5 folds (single) + 5 folds (full) = 10 total folds

This is why exp_011/012 failed - different fold structure


In [12]:
# Let's check the actual submission format requirements
import pandas as pd

# Load exp_005 submission (best LB)
exp005_sub = pd.read_csv('/home/code/experiments/005_no_tta_per_target/submission.csv')
print(f"exp_005 submission shape: {exp005_sub.shape}")
print(f"Columns: {exp005_sub.columns.tolist()}")
print(f"\nTask 0 (single solvent):")
print(f"  Folds: {exp005_sub[exp005_sub['task']==0]['fold'].unique()}")
print(f"  Total rows: {len(exp005_sub[exp005_sub['task']==0])}")
print(f"\nTask 1 (full data):")
print(f"  Folds: {exp005_sub[exp005_sub['task']==1]['fold'].unique()}")
print(f"  Total rows: {len(exp005_sub[exp005_sub['task']==1])}")


exp_005 submission shape: (1883, 8)
Columns: ['id', 'index', 'task', 'fold', 'row', 'target_1', 'target_2', 'target_3']

Task 0 (single solvent):
  Folds: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23]
  Total rows: 656

Task 1 (full data):
  Folds: [ 0  1  2  3  4  5  6  7  8  9 10 11 12]
  Total rows: 1227


In [13]:
# The submission format is FIXED by the template
# We CANNOT change the fold structure without breaking the submission

# So what CAN we do?
# 1. Better model architecture (within the template constraints)
# 2. Better feature engineering
# 3. Better hyperparameters
# 4. Better ensemble strategies

# The key insight: We need to improve generalization to UNSEEN solvents
# This is fundamentally a transfer learning / domain adaptation problem

print("=== STRATEGIC OPTIONS ===")
print("")
print("Option 1: Better features")
print("  - We've tried: Spange, DRFP, ACS_PCA, Morgan fingerprints")
print("  - Morgan fingerprints HURT performance (exp_024)")
print("  - Best: Spange + Arrhenius kinetics")
print("")
print("Option 2: Better model architecture")
print("  - We've tried: MLP, XGBoost, LightGBM, RF, HGB, ETR, GP")
print("  - Best: HGB for SM, ETR for Products (exp_004)")
print("  - GNN would be ideal but requires graph structure")
print("")
print("Option 3: Better ensemble")
print("  - We've tried: Various weighted ensembles")
print("  - Stacking didn't help (exp_023)")
print("  - Similarity-weighted didn't help (exp_022)")
print("")
print("Option 4: Regularization")
print("  - We've tried: Various regularization levels")
print("  - exp_021 with strong regularization got LB 0.1231 (WORSE)")


=== STRATEGIC OPTIONS ===

Option 1: Better features
  - We've tried: Spange, DRFP, ACS_PCA, Morgan fingerprints
  - Morgan fingerprints HURT performance (exp_024)
  - Best: Spange + Arrhenius kinetics

Option 2: Better model architecture
  - We've tried: MLP, XGBoost, LightGBM, RF, HGB, ETR, GP
  - Best: HGB for SM, ETR for Products (exp_004)
  - GNN would be ideal but requires graph structure

Option 3: Better ensemble
  - We've tried: Various weighted ensembles
  - Stacking didn't help (exp_023)
  - Similarity-weighted didn't help (exp_022)

Option 4: Regularization
  - We've tried: Various regularization levels
  - exp_021 with strong regularization got LB 0.1231 (WORSE)


In [14]:
# The fundamental problem:
# - Test set has COMPLETELY DIFFERENT solvents
# - Our models memorize training solvents, don't generalize
# - CV-LB gap is ~50% because LOO still has similar solvents in train

# The target (0.01727) is 5.5x better than our best LB (0.0956)
# This suggests the winning solution uses a fundamentally different approach

# Possible approaches we HAVEN'T tried:
# 1. Pre-trained molecular embeddings (ChemBERTa, MolBERT)
# 2. Physics-informed features (quantum chemical descriptors)
# 3. Meta-learning for few-shot adaptation
# 4. Domain adaptation techniques

print("=== UNTRIED APPROACHES ===")
print("")
print("1. Pre-trained molecular embeddings")
print("   - ChemBERTa, MolBERT, etc.")
print("   - These capture chemical knowledge that may generalize better")
print("   - BUT: Requires downloading pre-trained models")
print("")
print("2. Physics-informed features")
print("   - Quantum chemical descriptors (HOMO, LUMO, etc.)")
print("   - Solubility parameters, Hansen parameters")
print("   - BUT: May not be available in the data")
print("")
print("3. Meta-learning")
print("   - MAML, Prototypical Networks")
print("   - Learn to adapt quickly to new solvents")
print("   - BUT: Complex to implement within template constraints")
print("")
print("4. Domain adaptation")
print("   - Adversarial training to reduce domain shift")
print("   - BUT: Requires knowing test distribution")


=== UNTRIED APPROACHES ===

1. Pre-trained molecular embeddings
   - ChemBERTa, MolBERT, etc.
   - These capture chemical knowledge that may generalize better
   - BUT: Requires downloading pre-trained models

2. Physics-informed features
   - Quantum chemical descriptors (HOMO, LUMO, etc.)
   - Solubility parameters, Hansen parameters
   - BUT: May not be available in the data

3. Meta-learning
   - MAML, Prototypical Networks
   - Learn to adapt quickly to new solvents
   - BUT: Complex to implement within template constraints

4. Domain adaptation
   - Adversarial training to reduce domain shift
   - BUT: Requires knowing test distribution


In [15]:
# CRITICAL REALIZATION:
# The target (0.01727) is the TOP of the leaderboard
# This means SOMEONE has achieved it
# They must be using a fundamentally different approach

# Looking at the competition structure:
# - 211 teams
# - Top score is 0.01727
# - Our best is 0.0956 (5.5x worse)

# The winning solution likely uses:
# 1. GNN with molecular graphs (as suggested by research)
# 2. Pre-trained embeddings
# 3. Transfer learning from larger datasets
# 4. Or some clever trick we haven't discovered

print("=== FINAL ASSESSMENT ===")
print("")
print("With 1 submission remaining, we have two options:")
print("")
print("Option A: Submit our best model (exp_004/exp_016)")
print("  - Already submitted, LB = 0.0956")
print("  - No improvement expected")
print("")
print("Option B: Try a fundamentally different approach")
print("  - GNN (exp_020 failed due to implementation issues)")
print("  - Pre-trained embeddings (not yet tried)")
print("  - Meta-learning (complex to implement)")
print("")
print("RECOMMENDATION:")
print("  Given the constraints and remaining time, we should:")
print("  1. NOT submit exp_024 (CV 0.0881 is worse than exp_004)")
print("  2. Try to implement a GNN or pre-trained embedding approach")
print("  3. If that fails, submit our best model (exp_004)")


=== FINAL ASSESSMENT ===

With 1 submission remaining, we have two options:

Option A: Submit our best model (exp_004/exp_016)
  - Already submitted, LB = 0.0956
  - No improvement expected

Option B: Try a fundamentally different approach
  - GNN (exp_020 failed due to implementation issues)
  - Pre-trained embeddings (not yet tried)
  - Meta-learning (complex to implement)

RECOMMENDATION:
  Given the constraints and remaining time, we should:
  1. NOT submit exp_024 (CV 0.0881 is worse than exp_004)
  2. Try to implement a GNN or pre-trained embedding approach
  3. If that fails, submit our best model (exp_004)
