# Loop 14 Analysis: Strategic Assessment

## Key Questions:
1. What's the CV-LB gap pattern across experiments?
2. Why did exp_014 (Optuna) have worse CV than exp_004?
3. What approaches haven't been tried yet?
4. What's the best path to beat 0.01727?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Experiment history
experiments = [
    {'id': 'exp_000', 'name': 'Baseline Ensemble', 'cv': 0.0814, 'lb': None},
    {'id': 'exp_001', 'name': 'Template Compliant', 'cv': 0.0810, 'lb': None},
    {'id': 'exp_002', 'name': 'Simple RF', 'cv': 0.0805, 'lb': None},
    {'id': 'exp_003', 'name': 'PerTarget HGB+ETR', 'cv': 0.0813, 'lb': None},
    {'id': 'exp_004', 'name': 'PerTarget NO TTA', 'cv': 0.0623, 'lb': 0.0956},  # BEST CV
    {'id': 'exp_005', 'name': 'Ridge Baseline', 'cv': 0.0896, 'lb': None},
    {'id': 'exp_006', 'name': 'Intermediate Reg', 'cv': 0.0688, 'lb': 0.0991},
    {'id': 'exp_007', 'name': 'Gaussian Process', 'cv': 0.0721, 'lb': None},
    {'id': 'exp_008', 'name': 'Diverse Ensemble', 'cv': 0.0673, 'lb': None},
    {'id': 'exp_009', 'name': 'MLP+GBDT Ensemble', 'cv': 0.0669, 'lb': None},
    {'id': 'exp_010', 'name': 'GroupKFold Ensemble', 'cv': 0.0841, 'lb': None},
    {'id': 'exp_011', 'name': 'Template GroupKFold', 'cv': 0.0844, 'lb': 'ERROR'},  # Failed
    {'id': 'exp_012', 'name': 'LOO Ensemble', 'cv': 0.0827, 'lb': None},
    {'id': 'exp_013', 'name': 'Optuna PerTarget', 'cv': 0.0834, 'lb': None},  # Latest
]

df = pd.DataFrame(experiments)
print("Experiment Summary:")
print(df.to_string(index=False))

Experiment Summary:
     id                name     cv      lb
exp_000   Baseline Ensemble 0.0814    None
exp_001  Template Compliant 0.0810    None
exp_002           Simple RF 0.0805    None
exp_003   PerTarget HGB+ETR 0.0813    None
exp_004    PerTarget NO TTA 0.0623  0.0956
exp_005      Ridge Baseline 0.0896    None
exp_006    Intermediate Reg 0.0688  0.0991
exp_007    Gaussian Process 0.0721    None
exp_008    Diverse Ensemble 0.0673    None
exp_009   MLP+GBDT Ensemble 0.0669    None
exp_010 GroupKFold Ensemble 0.0841    None
exp_011 Template GroupKFold 0.0844   ERROR
exp_012        LOO Ensemble 0.0827    None
exp_013    Optuna PerTarget 0.0834    None


In [2]:
# CV-LB Gap Analysis
submissions = [
    {'exp': 'exp_004', 'cv': 0.0623, 'lb': 0.0956, 'gap': (0.0956-0.0623)/0.0623},
    {'exp': 'exp_006', 'cv': 0.0688, 'lb': 0.0991, 'gap': (0.0991-0.0688)/0.0688},
]

print("\n=== CV-LB GAP ANALYSIS ===")
for s in submissions:
    print(f"{s['exp']}: CV={s['cv']:.4f}, LB={s['lb']:.4f}, Gap={s['gap']*100:.1f}%")

print("\n=== KEY INSIGHT ===")
print("exp_004 (best CV 0.0623) had 53% gap to LB (0.0956)")
print("exp_006 (worse CV 0.0688) had 44% gap to LB (0.0991)")
print("More regularization made LB WORSE, not better!")
print("This means: The problem is NOT traditional overfitting")
print("We need BETTER features/models that generalize to unseen solvents")


=== CV-LB GAP ANALYSIS ===
exp_004: CV=0.0623, LB=0.0956, Gap=53.5%
exp_006: CV=0.0688, LB=0.0991, Gap=44.0%

=== KEY INSIGHT ===
exp_004 (best CV 0.0623) had 53% gap to LB (0.0956)
exp_006 (worse CV 0.0688) had 44% gap to LB (0.0991)
More regularization made LB WORSE, not better!
This means: The problem is NOT traditional overfitting
We need BETTER features/models that generalize to unseen solvents


In [3]:
# What approaches have been tried?
approaches_tried = {
    'Per-Target Models': ['exp_004 (HGB+ETR)', 'exp_013 (Optuna HGB+ETR)'],
    'Ensembles': ['exp_000 (MLP+XGB+LGB+RF)', 'exp_008 (PerTarget+RF+XGB+LGB)', 'exp_009 (MLP+GBDT)'],
    'Simple Models': ['exp_002 (RF)', 'exp_005 (Ridge)', 'exp_007 (GP)'],
    'Regularization': ['exp_006 (depth=5/7)', 'exp_013 (Optuna-tuned)'],
    'Features': ['Spange', 'DRFP-PCA', 'ACS_PCA', 'Combined'],
    'Validation': ['LOO (required)', 'GroupKFold (internal only)'],
}

print("\n=== APPROACHES TRIED ===")
for approach, exps in approaches_tried.items():
    print(f"\n{approach}:")
    for e in exps:
        print(f"  - {e}")

print("\n=== APPROACHES NOT YET TRIED ===")
print("1. MLP + Per-Target Hybrid (MLP for some targets, GBDT for others)")
print("2. Optuna for ensemble WEIGHTS (not just hyperparameters)")
print("3. Different MLP architectures (deeper, wider, residual)")
print("4. Feature interactions (polynomial, cross-terms)")
print("5. Target-specific feature selection")
print("6. Stacking with meta-learner")


=== APPROACHES TRIED ===

Per-Target Models:
  - exp_004 (HGB+ETR)
  - exp_013 (Optuna HGB+ETR)

Ensembles:
  - exp_000 (MLP+XGB+LGB+RF)
  - exp_008 (PerTarget+RF+XGB+LGB)
  - exp_009 (MLP+GBDT)

Simple Models:
  - exp_002 (RF)
  - exp_005 (Ridge)
  - exp_007 (GP)

Regularization:
  - exp_006 (depth=5/7)
  - exp_013 (Optuna-tuned)

Features:
  - Spange
  - DRFP-PCA
  - ACS_PCA
  - Combined

Validation:
  - LOO (required)
  - GroupKFold (internal only)

=== APPROACHES NOT YET TRIED ===
1. MLP + Per-Target Hybrid (MLP for some targets, GBDT for others)
2. Optuna for ensemble WEIGHTS (not just hyperparameters)
3. Different MLP architectures (deeper, wider, residual)
4. Feature interactions (polynomial, cross-terms)
5. Target-specific feature selection
6. Stacking with meta-learner


In [4]:
# Why did exp_014 (Optuna) have worse CV than exp_004?
print("\n=== WHY OPTUNA DIDN'T HELP ===")
print("\nexp_004 (best CV 0.0623):")
print("  - HGB: default params (depth=None, lr=0.1, iter=100)")
print("  - ETR: default params (depth=None, n_estimators=100)")
print("  - Features: Combined (0.8*ACS_PCA + 0.2*Spange)")

print("\nexp_014 (CV 0.0834):")
print("  - HGB: depth=3-4, lr=0.09-0.20, iter=208-326")
print("  - ETR: depth=6-20, n_estimators=188-494")
print("  - Features: Spange only")

print("\n=== HYPOTHESIS ===")
print("1. Optuna found SHALLOW models (depth 3-6) which may underfit")
print("2. exp_004 used COMBINED features, exp_014 used Spange only")
print("3. GroupKFold CV during Optuna may not correlate with LOO CV")
print("4. 50 trials may not be enough to find optimal params")

print("\n=== NEXT STEPS ===")
print("1. Try Optuna with COMBINED features (like exp_004)")
print("2. Try Optuna with DEEPER models (depth 10-20)")
print("3. Try Optuna for ensemble WEIGHTS")
print("4. Try MLP + Per-Target hybrid")


=== WHY OPTUNA DIDN'T HELP ===

exp_004 (best CV 0.0623):
  - HGB: default params (depth=None, lr=0.1, iter=100)
  - ETR: default params (depth=None, n_estimators=100)
  - Features: Combined (0.8*ACS_PCA + 0.2*Spange)

exp_014 (CV 0.0834):
  - HGB: depth=3-4, lr=0.09-0.20, iter=208-326
  - ETR: depth=6-20, n_estimators=188-494
  - Features: Spange only

=== HYPOTHESIS ===
1. Optuna found SHALLOW models (depth 3-6) which may underfit
2. exp_004 used COMBINED features, exp_014 used Spange only
3. GroupKFold CV during Optuna may not correlate with LOO CV
4. 50 trials may not be enough to find optimal params

=== NEXT STEPS ===
1. Try Optuna with COMBINED features (like exp_004)
2. Try Optuna with DEEPER models (depth 10-20)
3. Try Optuna for ensemble WEIGHTS
4. Try MLP + Per-Target hybrid


In [5]:
# Best performing experiment analysis
print("\n=== BEST EXPERIMENT: exp_004 ===")
print("Model: Per-Target (HGB for SM, ETR for Products)")
print("Features: Combined (0.8*ACS_PCA + 0.2*Spange)")
print("CV: 0.0623 (LOO)")
print("LB: 0.0956")
print("Gap: 53%")

print("\n=== WHAT MADE IT WORK ===")
print("1. Per-target models: Different models for different targets")
print("2. HGB for SM: Gradient boosting captures kinetics patterns")
print("3. ETR for Products: Robust to outliers, handles noise")
print("4. Combined features: ACS_PCA + Spange captures chemistry")
print("5. NO TTA: Test-time augmentation hurt performance")

print("\n=== WHAT TO IMPROVE ===")
print("1. Add MLP component for non-linear patterns")
print("2. Optimize ensemble weights with Optuna")
print("3. Try deeper models (exp_014 found shallow was worse)")
print("4. Add feature interactions")


=== BEST EXPERIMENT: exp_004 ===
Model: Per-Target (HGB for SM, ETR for Products)
Features: Combined (0.8*ACS_PCA + 0.2*Spange)
CV: 0.0623 (LOO)
LB: 0.0956
Gap: 53%

=== WHAT MADE IT WORK ===
1. Per-target models: Different models for different targets
2. HGB for SM: Gradient boosting captures kinetics patterns
3. ETR for Products: Robust to outliers, handles noise
4. Combined features: ACS_PCA + Spange captures chemistry
5. NO TTA: Test-time augmentation hurt performance

=== WHAT TO IMPROVE ===
1. Add MLP component for non-linear patterns
2. Optimize ensemble weights with Optuna
3. Try deeper models (exp_014 found shallow was worse)
4. Add feature interactions


In [6]:
# Target analysis
print("\n=== TARGET SCORE ANALYSIS ===")
print(f"Target: 0.01727")
print(f"Best LB: 0.0956")
print(f"Gap to target: {(0.0956/0.01727):.1f}x")

print("\n=== IS THE TARGET REACHABLE? ===")
print("The target (0.01727) is 5.5x lower than our best LB (0.0956)")
print("This is a HUGE gap that requires fundamentally different approaches")

print("\n=== POTENTIAL PATHS ===")
print("1. GNN with learned solvent encodings (arxiv:2512.19530 achieved MSE 0.0039)")
print("2. Transfer learning from large reaction datasets")
print("3. Physics-informed features (Arrhenius kinetics)")
print("4. Ensemble of diverse models with optimized weights")
print("5. Feature engineering for solvent generalization")

print("\n=== IMMEDIATE NEXT STEPS ===")
print("1. Try exp_004 approach with Optuna for ensemble weights")
print("2. Add MLP to per-target ensemble")
print("3. Try deeper models (exp_014 found shallow was worse)")
print("4. Submit exp_014 to verify CV-LB correlation")


=== TARGET SCORE ANALYSIS ===
Target: 0.01727
Best LB: 0.0956
Gap to target: 5.5x

=== IS THE TARGET REACHABLE? ===
The target (0.01727) is 5.5x lower than our best LB (0.0956)
This is a HUGE gap that requires fundamentally different approaches

=== POTENTIAL PATHS ===
1. GNN with learned solvent encodings (arxiv:2512.19530 achieved MSE 0.0039)
2. Transfer learning from large reaction datasets
3. Physics-informed features (Arrhenius kinetics)
4. Ensemble of diverse models with optimized weights
5. Feature engineering for solvent generalization

=== IMMEDIATE NEXT STEPS ===
1. Try exp_004 approach with Optuna for ensemble weights
2. Add MLP to per-target ensemble
3. Try deeper models (exp_014 found shallow was worse)
4. Submit exp_014 to verify CV-LB correlation
