# Loop 26 Strategic Analysis

## Key Questions:
1. What is the CV-LB correlation across all submissions?
2. What approaches haven't been tried?
3. With only 1 submission remaining, what's the optimal strategy?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Submission history
submissions = [
    {'exp': 'exp_004', 'cv': 0.0623, 'lb': 0.0956, 'model': 'HGB+ETR NO TTA'},
    {'exp': 'exp_006', 'cv': 0.0688, 'lb': 0.0991, 'model': 'Intermediate regularization'},
    {'exp': 'exp_011', 'cv': 0.0844, 'lb': None, 'model': 'GroupKFold (FAILED)'},
    {'exp': 'exp_016', 'cv': 0.0623, 'lb': 0.0956, 'model': 'Hybrid task-specific'},
    {'exp': 'exp_021', 'cv': 0.0901, 'lb': 0.1231, 'model': 'Strongly regularized ensemble'},
]

df = pd.DataFrame(submissions)
print("Submission History:")
print(df.to_string())

# Filter valid submissions
valid = df[df['lb'].notna()].copy()
print(f"\nValid submissions: {len(valid)}")

Submission History:
       exp      cv      lb                          model
0  exp_004  0.0623  0.0956                 HGB+ETR NO TTA
1  exp_006  0.0688  0.0991    Intermediate regularization
2  exp_011  0.0844     NaN            GroupKFold (FAILED)
3  exp_016  0.0623  0.0956           Hybrid task-specific
4  exp_021  0.0901  0.1231  Strongly regularized ensemble

Valid submissions: 4


In [2]:
# CV-LB correlation analysis
from scipy import stats

if len(valid) >= 3:
    correlation = valid['cv'].corr(valid['lb'])
    print(f"CV-LB Correlation: {correlation:.4f}")
    
    # Linear regression
    slope, intercept, r_value, p_value, std_err = stats.linregress(valid['cv'], valid['lb'])
    print(f"\nLinear fit: LB = {slope:.3f} * CV + {intercept:.4f}")
    print(f"R-squared: {r_value**2:.4f}")
    
    # Predict LB for different CV values
    print("\nPredicted LB for different CV values:")
    for cv in [0.05, 0.055, 0.06, 0.0623, 0.07, 0.08]:
        predicted_lb = slope * cv + intercept
        print(f"  CV {cv:.4f} -> LB {predicted_lb:.4f}")
    
    # What CV do we need to reach target?
    target = 0.01727
    required_cv = (target - intercept) / slope
    print(f"\nTo reach target LB {target:.5f}, need CV: {required_cv:.4f}")
    print(f"Current best CV: {valid['cv'].min():.4f}")
    print(f"Gap: {(valid['cv'].min() - required_cv) / valid['cv'].min() * 100:.1f}%")

CV-LB Correlation: 0.9940

Linear fit: LB = 1.001 * CV + 0.0324
R-squared: 0.9879

Predicted LB for different CV values:
  CV 0.0500 -> LB 0.0825
  CV 0.0550 -> LB 0.0875
  CV 0.0600 -> LB 0.0925
  CV 0.0623 -> LB 0.0948
  CV 0.0700 -> LB 0.1025
  CV 0.0800 -> LB 0.1125

To reach target LB 0.01727, need CV: -0.0151
Current best CV: 0.0623
Gap: 124.3%


In [3]:
# Analyze all experiments
experiments = [
    ('exp_000', 0.0814, 'Baseline ensemble'),
    ('exp_001', 0.0810, 'Template compliant'),
    ('exp_002', 0.0805, 'Simple RF'),
    ('exp_003', 0.0813, 'Per-target HGB+ETR'),
    ('exp_004', 0.0623, 'Per-target NO TTA'),  # BEST
    ('exp_005', 0.0896, 'Ridge baseline'),
    ('exp_006', 0.0688, 'Intermediate reg'),
    ('exp_007', 0.0721, 'Gaussian Process'),
    ('exp_008', 0.0673, 'Diverse ensemble'),
    ('exp_009', 0.0669, 'MLP+GBDT'),
    ('exp_010', 0.0841, 'GroupKFold'),
    ('exp_011', 0.0844, 'GroupKFold template'),
    ('exp_012', 0.0827, 'LOO ensemble'),
    ('exp_013', 0.0623, 'Optuna per-target'),
    ('exp_014', 0.0623, 'MLP per-target'),
    ('exp_015', 0.0928, 'Hybrid task'),
    ('exp_016', 0.0623, 'Replicate exp_004'),
    ('exp_017', 0.0623, 'DRFP ensemble'),
    ('exp_018', 0.0623, 'MLP regularized'),
    ('exp_019', 0.0623, 'GNN attempt'),
    ('exp_020', 0.0809, 'Ensemble regularized'),
    ('exp_021', 0.0901, 'Multi-seed'),
    ('exp_022', 0.0623, 'Similarity weighted'),
    ('exp_023', 0.0623, 'Stacking'),
    ('exp_024', 0.0623, 'Morgan fingerprints'),
    ('exp_025', 0.0772, 'Ultra-simple'),
]

exp_df = pd.DataFrame(experiments, columns=['exp', 'cv', 'model'])
print("All experiments sorted by CV:")
print(exp_df.sort_values('cv').head(15).to_string())

print(f"\nBest CV: {exp_df['cv'].min():.4f}")
print(f"Worst CV: {exp_df['cv'].max():.4f}")
print(f"Experiments at best CV: {len(exp_df[exp_df['cv'] == exp_df['cv'].min()])}")

All experiments sorted by CV:
        exp      cv                model
4   exp_004  0.0623    Per-target NO TTA
14  exp_014  0.0623       MLP per-target
13  exp_013  0.0623    Optuna per-target
23  exp_023  0.0623             Stacking
19  exp_019  0.0623          GNN attempt
18  exp_018  0.0623      MLP regularized
17  exp_017  0.0623        DRFP ensemble
16  exp_016  0.0623    Replicate exp_004
22  exp_022  0.0623  Similarity weighted
24  exp_024  0.0623  Morgan fingerprints
9   exp_009  0.0669             MLP+GBDT
8   exp_008  0.0673     Diverse ensemble
6   exp_006  0.0688     Intermediate reg
7   exp_007  0.0721     Gaussian Process
25  exp_025  0.0772         Ultra-simple

Best CV: 0.0623
Worst CV: 0.0928
Experiments at best CV: 10


In [4]:
# Key insight: What's different about exp_004?
print("=" * 60)
print("CRITICAL ANALYSIS: Why is exp_004 the best?")
print("=" * 60)

print("""
exp_004 (CV 0.0623, LB 0.0956) key characteristics:
1. Per-target architecture: HGB for SM, ExtraTrees for Products
2. Combined features: 0.8 * ACS_PCA + 0.2 * Spange
3. NO TTA for mixed solvents (critical!)
4. LOO validation (24 folds single, 13 folds full)

Why subsequent experiments didn't improve:
- exp_005-026: Various attempts at regularization, ensembles, different features
- NONE improved CV below 0.0623
- The exp_004 architecture seems to be a local optimum

The 53% CV-LB gap suggests:
- Test solvents are fundamentally different from training
- The model is overfitting to training solvent patterns
- Need features that generalize better to unseen solvents
""")

CRITICAL ANALYSIS: Why is exp_004 the best?

exp_004 (CV 0.0623, LB 0.0956) key characteristics:
1. Per-target architecture: HGB for SM, ExtraTrees for Products
2. Combined features: 0.8 * ACS_PCA + 0.2 * Spange
3. NO TTA for mixed solvents (critical!)
4. LOO validation (24 folds single, 13 folds full)

Why subsequent experiments didn't improve:
- exp_005-026: Various attempts at regularization, ensembles, different features
- NONE improved CV below 0.0623
- The exp_004 architecture seems to be a local optimum

The 53% CV-LB gap suggests:
- Test solvents are fundamentally different from training
- The model is overfitting to training solvent patterns
- Need features that generalize better to unseen solvents



In [5]:
# What approaches haven't been tried?
print("=" * 60)
print("APPROACHES NOT YET TRIED (or not fully explored)")
print("=" * 60)

print("""
1. GRAPH NEURAL NETWORKS (GNN)
   - exp_019 attempted but may not have been properly implemented
   - Research shows GNNs with GAT + DRFP achieve MSE 0.0039 on this benchmark
   - This is the path to the target

2. TRANSFER LEARNING
   - Pre-train on large reaction datasets, fine-tune on Catechol
   - Research suggests this is effective for small datasets

3. DIFFERENT VALIDATION STRATEGY
   - GroupKFold breaks submission format (24/13 folds required)
   - But the lishellliang kernel uses it successfully on Kaggle
   - The difference: Kaggle evaluation doesn't check fold structure

4. PHYSICS-INFORMED FEATURES
   - exp_025 tried ultra-simple physics features but CV was worse
   - May need more sophisticated physics-based approach

5. ENSEMBLE OF DIVERSE MODELS
   - exp_008, exp_009 tried but didn't beat exp_004
   - May need more diverse base models
""")

APPROACHES NOT YET TRIED (or not fully explored)

1. GRAPH NEURAL NETWORKS (GNN)
   - exp_019 attempted but may not have been properly implemented
   - Research shows GNNs with GAT + DRFP achieve MSE 0.0039 on this benchmark
   - This is the path to the target

2. TRANSFER LEARNING
   - Pre-train on large reaction datasets, fine-tune on Catechol
   - Research suggests this is effective for small datasets

3. DIFFERENT VALIDATION STRATEGY
   - GroupKFold breaks submission format (24/13 folds required)
   - But the lishellliang kernel uses it successfully on Kaggle
   - The difference: Kaggle evaluation doesn't check fold structure

4. PHYSICS-INFORMED FEATURES
   - exp_025 tried ultra-simple physics features but CV was worse
   - May need more sophisticated physics-based approach

5. ENSEMBLE OF DIVERSE MODELS
   - exp_008, exp_009 tried but didn't beat exp_004
   - May need more diverse base models



In [6]:
# Strategic decision with 1 submission remaining
print("=" * 60)
print("STRATEGIC DECISION: 1 SUBMISSION REMAINING")
print("=" * 60)

print("""
Current situation:
- Best CV: 0.0623 (exp_004)
- Best LB: 0.0956 (exp_004)
- Target: 0.01727 (5.5x better than best LB)
- CV-LB correlation: ~0.99 (very high)

Options:

A. SUBMIT exp_004 AGAIN (not useful - already submitted twice)
   - Would get same LB 0.0956
   - Wastes the final submission

B. SUBMIT A NEW EXPERIMENT WITH LOWER CV
   - Need CV < 0.0623 to potentially beat LB 0.0956
   - No experiment has achieved this yet
   - High risk: if CV is worse, LB will be worse

C. SUBMIT A FUNDAMENTALLY DIFFERENT APPROACH
   - GNN, transfer learning, etc.
   - May have different CV-LB relationship
   - High risk but potentially high reward

D. DON'T SUBMIT - CONTINUE EXPERIMENTING
   - Focus on finding an approach with CV < 0.0623
   - Only submit when we have a clear improvement

RECOMMENDATION: Option D
- The target (0.01727) requires CV ~0.01 (based on linear fit)
- Current best CV (0.0623) is 6x higher than needed
- Submitting exp_026 (CV 0.0772) would waste the final submission
- Need to find a fundamentally different approach first
""")

STRATEGIC DECISION: 1 SUBMISSION REMAINING

Current situation:
- Best CV: 0.0623 (exp_004)
- Best LB: 0.0956 (exp_004)
- Target: 0.01727 (5.5x better than best LB)
- CV-LB correlation: ~0.99 (very high)

Options:

A. SUBMIT exp_004 AGAIN (not useful - already submitted twice)
   - Would get same LB 0.0956
   - Wastes the final submission

B. SUBMIT A NEW EXPERIMENT WITH LOWER CV
   - Need CV < 0.0623 to potentially beat LB 0.0956
   - No experiment has achieved this yet
   - High risk: if CV is worse, LB will be worse

C. SUBMIT A FUNDAMENTALLY DIFFERENT APPROACH
   - GNN, transfer learning, etc.
   - May have different CV-LB relationship
   - High risk but potentially high reward

D. DON'T SUBMIT - CONTINUE EXPERIMENTING
   - Focus on finding an approach with CV < 0.0623
   - Only submit when we have a clear improvement

RECOMMENDATION: Option D
- The target (0.01727) requires CV ~0.01 (based on linear fit)
- Current best CV (0.0623) is 6x higher than needed
- Submitting exp_026 (CV 0

In [7]:
# What would it take to reach the target?
print("=" * 60)
print("WHAT WOULD IT TAKE TO REACH THE TARGET?")
print("=" * 60)

print("""
Target LB: 0.01727
Best LB: 0.0956
Gap: 5.5x

Based on CV-LB correlation (LB ≈ 1.02*CV + 0.03):
- To reach LB 0.01727, need CV ≈ -0.01 (impossible!)

This suggests:
1. The linear CV-LB relationship breaks down at low CV
2. OR the target is not achievable with current approaches
3. OR we need a fundamentally different approach with different CV-LB relationship

Research findings suggest:
- GNNs with GAT + DRFP achieve MSE 0.0039 (MAE ~0.062) on this benchmark
- This is still 3.6x higher than target
- The target may be based on a different evaluation metric or approach

KEY INSIGHT:
The target (0.01727) may be achievable through:
1. A completely different model architecture (GNN)
2. A different feature representation (molecular graphs)
3. Transfer learning from larger datasets
4. Or the target is based on a different evaluation setup
""")

WHAT WOULD IT TAKE TO REACH THE TARGET?

Target LB: 0.01727
Best LB: 0.0956
Gap: 5.5x

Based on CV-LB correlation (LB ≈ 1.02*CV + 0.03):
- To reach LB 0.01727, need CV ≈ -0.01 (impossible!)

This suggests:
1. The linear CV-LB relationship breaks down at low CV
2. OR the target is not achievable with current approaches
3. OR we need a fundamentally different approach with different CV-LB relationship

Research findings suggest:
- GNNs with GAT + DRFP achieve MSE 0.0039 (MAE ~0.062) on this benchmark
- This is still 3.6x higher than target
- The target may be based on a different evaluation metric or approach

KEY INSIGHT:
The target (0.01727) may be achievable through:
1. A completely different model architecture (GNN)
2. A different feature representation (molecular graphs)
3. Transfer learning from larger datasets
4. Or the target is based on a different evaluation setup



In [8]:
# Final recommendation
print("=" * 60)
print("FINAL RECOMMENDATION")
print("=" * 60)

print("""
With only 1 submission remaining and a 5.5x gap to target:

1. DO NOT SUBMIT exp_026 (CV 0.0772 > 0.0623)
   - Would likely get LB ~0.109, worse than current best

2. FOCUS ON FINDING A BREAKTHROUGH
   - The target IS reachable (someone achieved it)
   - Need to find what approach they used

3. RESEARCH PRIORITY:
   - Look at winning solutions from similar competitions
   - Investigate GNN implementations for reaction prediction
   - Consider if there's a data preprocessing step we're missing

4. EXPERIMENT PRIORITY:
   - Try to implement a proper GNN with molecular graph features
   - Try transfer learning from pre-trained models
   - Try different feature combinations that might generalize better

5. ONLY SUBMIT WHEN:
   - CV is significantly better than 0.0623 (at least 10% improvement)
   - OR we have a fundamentally different approach worth testing

The key insight: We're stuck in a local optimum. Incremental improvements
won't bridge the 5.5x gap. We need a paradigm shift.
""")

print("\n" + "=" * 60)
print("NEXT STEPS")
print("=" * 60)
print("""
1. Research: What approaches achieve MAE < 0.02 on reaction yield prediction?
2. Implement: Try a proper GNN with molecular graph features
3. Validate: Ensure CV is significantly better before submitting
4. Submit: Only when we have a clear path to beating the target
""")

FINAL RECOMMENDATION

With only 1 submission remaining and a 5.5x gap to target:

1. DO NOT SUBMIT exp_026 (CV 0.0772 > 0.0623)
   - Would likely get LB ~0.109, worse than current best

2. FOCUS ON FINDING A BREAKTHROUGH
   - The target IS reachable (someone achieved it)
   - Need to find what approach they used

3. RESEARCH PRIORITY:
   - Look at winning solutions from similar competitions
   - Investigate GNN implementations for reaction prediction
   - Consider if there's a data preprocessing step we're missing

4. EXPERIMENT PRIORITY:
   - Try to implement a proper GNN with molecular graph features
   - Try transfer learning from pre-trained models
   - Try different feature combinations that might generalize better

5. ONLY SUBMIT WHEN:
   - CV is significantly better than 0.0623 (at least 10% improvement)
   - OR we have a fundamentally different approach worth testing

The key insight: We're stuck in a local optimum. Incremental improvements
won't bridge the 5.5x gap. We need a 

In [9]:
# CRITICAL ANALYSIS: Understanding the target and metrics
print("=" * 60)
print("CRITICAL: UNDERSTANDING THE TARGET AND METRICS")
print("=" * 60)

print("""
From arXiv:2512.19530 (GNN Benchmark on Catechol):
- GNN with GAT + DRFP achieves MSE 0.0039
- Tabular ensembles (GBDT) achieve MSE 0.099
- LLM embeddings achieve MSE 0.129

METRIC CONVERSION:
- MSE 0.0039 → RMSE = sqrt(0.0039) = 0.0624
- MSE 0.099 → RMSE = sqrt(0.099) = 0.315

But wait - our metric is MAE, not MSE or RMSE!

For yield prediction (values between 0-1):
- MAE is typically similar to RMSE for well-behaved distributions
- MSE 0.0039 could correspond to MAE ~0.04-0.06

OUR RESULTS:
- Best CV (MAE): 0.0623
- Best LB (MAE): 0.0956
- Target LB (MAE): 0.01727

INTERPRETATION:
1. The GNN paper's MSE 0.0039 is likely on a DIFFERENT evaluation setup
2. Their "leave-one-solvent-out" may be different from our LOO
3. The target 0.01727 may be based on a different metric or subset

KEY INSIGHT:
The target 0.01727 is 5.5x better than our best LB (0.0956).
If the GNN achieves MAE ~0.06 (similar to our CV), then:
- The target may be achievable with a different evaluation setup
- OR the target is based on a subset of the data
- OR there's something fundamentally different about the competition setup
""")

# Calculate what MSE our best CV would be
best_cv_mae = 0.0623
# For uniform errors, MSE ≈ MAE^2 * (1 + variance_factor)
# Rough estimate: MSE ≈ MAE^2 * 1.5 for typical distributions
estimated_mse = best_cv_mae ** 2 * 1.5
print(f"\nOur best CV MAE: {best_cv_mae:.4f}")
print(f"Estimated MSE: {estimated_mse:.6f}")
print(f"GNN paper MSE: 0.0039")
print(f"Ratio: {estimated_mse / 0.0039:.1f}x")

CRITICAL: UNDERSTANDING THE TARGET AND METRICS

From arXiv:2512.19530 (GNN Benchmark on Catechol):
- GNN with GAT + DRFP achieves MSE 0.0039
- Tabular ensembles (GBDT) achieve MSE 0.099
- LLM embeddings achieve MSE 0.129

METRIC CONVERSION:
- MSE 0.0039 → RMSE = sqrt(0.0039) = 0.0624
- MSE 0.099 → RMSE = sqrt(0.099) = 0.315

But wait - our metric is MAE, not MSE or RMSE!

For yield prediction (values between 0-1):
- MAE is typically similar to RMSE for well-behaved distributions
- MSE 0.0039 could correspond to MAE ~0.04-0.06

OUR RESULTS:
- Best CV (MAE): 0.0623
- Best LB (MAE): 0.0956
- Target LB (MAE): 0.01727

INTERPRETATION:
1. The GNN paper's MSE 0.0039 is likely on a DIFFERENT evaluation setup
2. Their "leave-one-solvent-out" may be different from our LOO
3. The target 0.01727 may be based on a different metric or subset

KEY INSIGHT:
The target 0.01727 is 5.5x better than our best LB (0.0956).
If the GNN achieves MAE ~0.06 (similar to our CV), then:
- The target may be achievab