# Evolver Loop 11 Analysis

## Key Issues to Address
1. **Template Compliance**: exp_011 violated template format (missing 'row' column)
2. **CV-LB Gap Analysis**: GroupKFold CV 0.0841 vs expected LB ~0.09-0.10
3. **Target Gap**: Current best LB 0.0956, target 0.01727 (5.5x gap)

In [1]:
import pandas as pd
import numpy as np

# Analyze submission history
submissions = [
    {'exp': 'exp_004', 'cv': 0.0623, 'lb': 0.0956, 'model': 'PerTarget (HGB+ETR) NO TTA', 'validation': 'LOO'},
    {'exp': 'exp_006', 'cv': 0.0688, 'lb': 0.0991, 'model': 'PerTarget (HGB+ETR) depth=5/7', 'validation': 'LOO'},
]

df = pd.DataFrame(submissions)
df['cv_lb_gap'] = (df['lb'] - df['cv']) / df['cv'] * 100
print("=== SUBMISSION HISTORY ===")
print(df.to_string(index=False))
print(f"\nAverage CV-LB gap: {df['cv_lb_gap'].mean():.1f}%")

=== SUBMISSION HISTORY ===
    exp     cv     lb                         model validation  cv_lb_gap
exp_004 0.0623 0.0956    PerTarget (HGB+ETR) NO TTA        LOO  53.451043
exp_006 0.0688 0.0991 PerTarget (HGB+ETR) depth=5/7        LOO  44.040698

Average CV-LB gap: 48.7%


In [2]:
# Analyze exp_011 GroupKFold results
print("=== EXP_011 GROUPKFOLD RESULTS ===")
print("Single Solvent CV: 0.0733")
print("Full Data CV: 0.0899")
print("Combined CV: 0.0841")
print("\nComparison:")
print("- Best LOO CV (exp_004): 0.0623")
print("- Best LB (exp_004): 0.0956")
print("- LOO CV-LB gap: 53%")
print("\n- GroupKFold CV (exp_011): 0.0841")
print("- Expected LB with 12% gap: ~0.094")
print("- Expected LB with 20% gap: ~0.101")
print("\nGroupKFold gives MORE REALISTIC CV estimates!")

=== EXP_011 GROUPKFOLD RESULTS ===
Single Solvent CV: 0.0733
Full Data CV: 0.0899
Combined CV: 0.0841

Comparison:
- Best LOO CV (exp_004): 0.0623
- Best LB (exp_004): 0.0956
- LOO CV-LB gap: 53%

- GroupKFold CV (exp_011): 0.0841
- Expected LB with 12% gap: ~0.094
- Expected LB with 20% gap: ~0.101

GroupKFold gives MORE REALISTIC CV estimates!


In [3]:
# Key insight: The target (0.01727) is 5.5x better than our best LB (0.0956)
# This suggests we need a FUNDAMENTALLY different approach

print("=== GAP TO TARGET ===")
target = 0.01727
best_lb = 0.0956
gap_factor = best_lb / target
print(f"Target: {target}")
print(f"Best LB: {best_lb}")
print(f"Gap factor: {gap_factor:.1f}x")
print("\nTo reach target, we need to reduce MAE by 82%!")
print("\nPossible approaches:")
print("1. GNN with molecular graphs (paper arxiv:2512.19530 achieved MSE 0.0039)")
print("2. Better feature engineering (DRFP + Spange + physics-based)")
print("3. Per-target specialized models with Optuna tuning")
print("4. Ensemble with learned weights (not fixed)")

=== GAP TO TARGET ===
Target: 0.01727
Best LB: 0.0956
Gap factor: 5.5x

To reach target, we need to reduce MAE by 82%!

Possible approaches:
1. GNN with molecular graphs (paper arxiv:2512.19530 achieved MSE 0.0039)
2. Better feature engineering (DRFP + Spange + physics-based)
3. Per-target specialized models with Optuna tuning
4. Ensemble with learned weights (not fixed)


In [4]:
# Analyze what the top kernel does differently
print("=== TOP KERNEL (lishellliang) ANALYSIS ===")
print("\n1. VALIDATION: GroupKFold (5-fold) - SAME as exp_011")
print("\n2. MODEL ARCHITECTURE:")
print("   - MLP: [128, 64, 32] with BatchNorm + ReLU + Dropout")
print("   - XGBoost: max_depth tuned by Optuna (3-8)")
print("   - RandomForest: max_depth tuned by Optuna (5-15)")
print("   - LightGBM: num_leaves tuned by Optuna (15-63)")
print("\n3. ENSEMBLE WEIGHTS: Learned via Optuna (NOT fixed!)")
print("   - w_mlp, w_xgb, w_rf, w_lgb all tuned")
print("   - Normalized to sum to 1")
print("\n4. HYPERPARAMETERS: All tuned via Optuna")
print("   - lr: 1e-4 to 1e-2 (log scale)")
print("   - dropout: 0.1 to 0.5")
print("   - hidden_dim: 64 to 256")
print("\n5. FEATURES: Spange descriptors only")

=== TOP KERNEL (lishellliang) ANALYSIS ===

1. VALIDATION: GroupKFold (5-fold) - SAME as exp_011

2. MODEL ARCHITECTURE:
   - MLP: [128, 64, 32] with BatchNorm + ReLU + Dropout
   - XGBoost: max_depth tuned by Optuna (3-8)
   - RandomForest: max_depth tuned by Optuna (5-15)
   - LightGBM: num_leaves tuned by Optuna (15-63)

3. ENSEMBLE WEIGHTS: Learned via Optuna (NOT fixed!)
   - w_mlp, w_xgb, w_rf, w_lgb all tuned
   - Normalized to sum to 1

4. HYPERPARAMETERS: All tuned via Optuna
   - lr: 1e-4 to 1e-2 (log scale)
   - dropout: 0.1 to 0.5
   - hidden_dim: 64 to 256

5. FEATURES: Spange descriptors only


In [5]:
# Critical issue: Template compliance
print("=== TEMPLATE COMPLIANCE ISSUE ===")
print("\nexp_011 used INCORRECT format:")
print("```python")
print("predictions_df = pd.DataFrame(predictions.numpy(), columns=['target_1', 'target_2', 'target_3'])")
print("predictions_df['fold'] = fold_idx")
print("predictions_df['task'] = 0")
print("# MISSING 'row' column!")
print("```")
print("\nCORRECT format (from template):")
print("```python")
print("for row_idx, row in enumerate(predictions_np):")
print("    all_predictions.append({")
print("        'task': 0,")
print("        'fold': fold_idx,")
print("        'row': row_idx,  # <-- REQUIRED")
print("        'target_1': row[0],")
print("        'target_2': row[1],")
print("        'target_3': row[2]")
print("    })")
print("```")
print("\nThis MUST be fixed before submission!")

=== TEMPLATE COMPLIANCE ISSUE ===

exp_011 used INCORRECT format:
```python
predictions_df = pd.DataFrame(predictions.numpy(), columns=['target_1', 'target_2', 'target_3'])
predictions_df['fold'] = fold_idx
predictions_df['task'] = 0
# MISSING 'row' column!
```

CORRECT format (from template):
```python
for row_idx, row in enumerate(predictions_np):
    all_predictions.append({
        'task': 0,
        'fold': fold_idx,
        'row': row_idx,  # <-- REQUIRED
        'target_1': row[0],
        'target_2': row[1],
        'target_3': row[2]
    })
```

This MUST be fixed before submission!


In [6]:
# Strategy recommendations
print("=== STRATEGY RECOMMENDATIONS ===")
print("\n1. FIX TEMPLATE COMPLIANCE (CRITICAL)")
print("   - Revert last 3 cells to exact template format")
print("   - Only change model definition line")
print("   - Move CV calculation to earlier cells")
print("\n2. USE OPTUNA FOR HYPERPARAMETER TUNING")
print("   - Top kernel uses Optuna for ALL hyperparameters")
print("   - Including ensemble weights (not fixed!)")
print("   - This could significantly improve performance")
print("\n3. TRY DIFFERENT FEATURE COMBINATIONS")
print("   - DRFP + Spange (from earlier analysis)")
print("   - Physics-based features (Arrhenius)")
print("\n4. SUBMIT TO VERIFY CV-LB CORRELATION")
print("   - GroupKFold should give better CV-LB correlation")
print("   - Need to verify with actual LB score")
print("\n5. CONSIDER GNN APPROACH")
print("   - Paper arxiv:2512.19530 achieved MSE 0.0039")
print("   - This is 25x better than tabular ensembles")
print("   - May require significant implementation effort")

=== STRATEGY RECOMMENDATIONS ===

1. FIX TEMPLATE COMPLIANCE (CRITICAL)
   - Revert last 3 cells to exact template format
   - Only change model definition line
   - Move CV calculation to earlier cells

2. USE OPTUNA FOR HYPERPARAMETER TUNING
   - Top kernel uses Optuna for ALL hyperparameters
   - Including ensemble weights (not fixed!)
   - This could significantly improve performance

3. TRY DIFFERENT FEATURE COMBINATIONS
   - DRFP + Spange (from earlier analysis)
   - Physics-based features (Arrhenius)

4. SUBMIT TO VERIFY CV-LB CORRELATION
   - GroupKFold should give better CV-LB correlation
   - Need to verify with actual LB score

5. CONSIDER GNN APPROACH
   - Paper arxiv:2512.19530 achieved MSE 0.0039
   - This is 25x better than tabular ensembles
   - May require significant implementation effort
