# Loop 57 Analysis: Post-XGBoost+RF Ensemble Assessment

**Situation:**
- 57 experiments completed, 26 consecutive failures since exp_030
- Best LB: 0.0877 (exp_030), Target: 0.0347
- Gap: 2.53x (0.0877 / 0.0347)
- 5 submissions remaining
- exp_056 (XGBoost + RandomForest Ensemble) FAILED - CV 0.014233 (71.5% worse)

**Key Finding from exp_056:**
XGBoost and RandomForest don't help - they actually hurt performance. The GP + MLP + LGBM ensemble from exp_030 remains the best approach.

**Critical Evaluator Insight:**
The 'mixall' kernel OVERWRITES the CV functions to use GroupKFold(5) instead of Leave-One-Out(24). This is why their local CV correlates better with LB. Our Leave-One-Out(24) CV doesn't predict LB well.

**Questions:**
1. Can we use GroupKFold(5) locally to understand the CV-LB relationship better?
2. What fundamentally different approaches remain?
3. How can we change the CV-LB relationship?

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

# Submission history
submissions = [
    {'exp': 'exp_000', 'cv': 0.0111, 'lb': 0.0982},
    {'exp': 'exp_001', 'cv': 0.0123, 'lb': 0.1065},
    {'exp': 'exp_003', 'cv': 0.0105, 'lb': 0.0972},
    {'exp': 'exp_005', 'cv': 0.0104, 'lb': 0.0969},
    {'exp': 'exp_006', 'cv': 0.0097, 'lb': 0.0946},
    {'exp': 'exp_007', 'cv': 0.0093, 'lb': 0.0932},
    {'exp': 'exp_009', 'cv': 0.0092, 'lb': 0.0936},
    {'exp': 'exp_012', 'cv': 0.0090, 'lb': 0.0913},
    {'exp': 'exp_024', 'cv': 0.0087, 'lb': 0.0893},
    {'exp': 'exp_026', 'cv': 0.0085, 'lb': 0.0887},
    {'exp': 'exp_030', 'cv': 0.0083, 'lb': 0.0877},
    {'exp': 'exp_035', 'cv': 0.0098, 'lb': 0.0970},
]

df = pd.DataFrame(submissions)
print("Submission History:")
print(df.to_string(index=False))
print(f"\nTarget LB: 0.0347")
print(f"Best LB: {df['lb'].min():.4f} ({df.loc[df['lb'].idxmin(), 'exp']})")
print(f"Gap to target: {df['lb'].min() / 0.0347:.2f}x")

In [None]:
# CV-LB relationship analysis
cv = df['cv'].values
lb = df['lb'].values

slope, intercept, r_value, p_value, std_err = stats.linregress(cv, lb)

print(f"CV-LB Linear Relationship:")
print(f"  LB = {slope:.2f} * CV + {intercept:.4f}")
print(f"  RÂ² = {r_value**2:.4f}")
print(f"  Intercept = {intercept:.4f}")
print(f"  Target LB = 0.0347")
print(f"")
print(f"CRITICAL INSIGHT:")
print(f"  Intercept ({intercept:.4f}) > Target ({0.0347})")
print(f"  This means even with CV=0, LB would be {intercept:.4f} > 0.0347")
print(f"")
print(f"Required CV to hit target:")
required_cv = (0.0347 - intercept) / slope
print(f"  CV = (0.0347 - {intercept:.4f}) / {slope:.2f} = {required_cv:.6f}")
if required_cv < 0:
    print(f"  NEGATIVE CV required - target is UNREACHABLE with current approach!")

In [None]:
# Analyze the 'mixall' kernel approach
print("="*60)
print("ANALYSIS: The 'mixall' Kernel CV Scheme")
print("="*60)

print("\nThe 'mixall' kernel OVERWRITES the utility functions:")
print("")
print("  def generate_leave_one_out_splits(X, Y):")
print("      groups = X['SOLVENT NAME']")
print("      n_splits = min(5, n_groups)")
print("      gkf = GroupKFold(n_splits=n_splits)")
print("      ...")
print("")
print("This means:")
print("  - Our CV: Leave-One-Solvent-Out (24 folds for single, 13 for mixtures)")
print("  - mixall CV: GroupKFold (5 folds for single, 5 for mixtures)")
print("")
print("Why this matters:")
print("  1. GroupKFold(5) is LESS PESSIMISTIC than Leave-One-Out(24)")
print("  2. Each fold in GroupKFold(5) has ~5 solvents in test set")
print("  3. The model sees more diverse training data per fold")
print("  4. This may lead to better generalization to unseen solvents")
print("")
print("HOWEVER:")
print("  - The LB evaluation uses the OFFICIAL Leave-One-Out scheme")
print("  - So the mixall kernel's local CV doesn't match LB evaluation")
print("  - But their model may still generalize better due to training dynamics")

In [None]:
# What approaches have we exhausted?
print("="*60)
print("EXHAUSTED APPROACHES (26 consecutive failures)")
print("="*60)

exhausted = [
    "Higher GP weight (exp_031, exp_035)",
    "Pure GP (exp_032)",
    "Ridge regression (exp_033)",
    "Kernel Ridge (exp_034)",
    "Lower GP weight (exp_035)",
    "No GP (exp_036)",
    "Similarity weighting (exp_037)",
    "Minimal features (exp_038)",
    "Learned embeddings (exp_039)",
    "GNN architectures (exp_040, exp_052)",
    "ChemBERTa (exp_041)",
    "Calibration (exp_042)",
    "Nonlinear mixture (exp_043)",
    "Hybrid model (exp_044)",
    "Mean reversion (exp_045)",
    "Adaptive weighting (exp_046)",
    "Diverse ensemble (exp_047)",
    "Hybrid features (exp_048)",
    "Manual OOD handling (exp_049)",
    "LISA/REX (exp_050)",
    "Simpler model (exp_051)",
    "GNN proper (exp_052)",
    "Mixall full features (exp_053)",
    "Simpler regularized (exp_054)",
    "Chemical constraints (exp_055)",
    "XGBoost + RandomForest ensemble (exp_056)",
]

for i, approach in enumerate(exhausted, 1):
    print(f"  {i}. {approach}")

In [None]:
# What fundamentally different approaches remain?
print("="*60)
print("REMAINING APPROACHES (Not Yet Tried)")
print("="*60)

print("\n1. PREDICTION CALIBRATION (Isotonic Regression):")
print("   - Use isotonic regression to calibrate predictions")
print("   - This explicitly corrects systematic bias")
print("   - Could reduce the intercept in CV-LB relationship")
print("   - Different from exp_042 which used Platt scaling")

print("\n2. QUANTILE REGRESSION:")
print("   - Train model to predict quantiles instead of mean")
print("   - May produce more robust predictions")
print("   - Different loss function could change CV-LB relationship")

print("\n3. ASYMMETRIC LOSS:")
print("   - Penalize over-predictions differently from under-predictions")
print("   - May help with the systematic bias")

print("\n4. FOCAL LOSS:")
print("   - Focus on hard examples")
print("   - May help with outlier solvents like HFIP, Cyclohexane")

print("\n5. CATBOOST:")
print("   - Different gradient boosting implementation")
print("   - Handles categorical features natively")
print("   - May have different inductive biases")

print("\n6. STACKING META-LEARNER:")
print("   - Train a meta-learner on top of base model predictions")
print("   - Could learn to correct systematic biases")

In [None]:
# Analyze the CV-LB gap for different experiments
print("="*60)
print("CV-LB GAP ANALYSIS")
print("="*60)

df['gap'] = df['lb'] / df['cv']
df['residual'] = df['lb'] - (slope * df['cv'] + intercept)

print("\nCV-LB Gap (LB/CV ratio):")
for _, row in df.iterrows():
    print(f"  {row['exp']}: CV={row['cv']:.4f}, LB={row['lb']:.4f}, Gap={row['gap']:.2f}x, Residual={row['residual']:.4f}")

print(f"\nBest residual (below regression line): {df.loc[df['residual'].idxmin(), 'exp']} ({df['residual'].min():.4f})")
print(f"Worst residual (above regression line): {df.loc[df['residual'].idxmax(), 'exp']} ({df['residual'].max():.4f})")

print(f"\nKey insight:")
print(f"  exp_000 has the best residual (-0.0022)")
print(f"  This means exp_000 performed BETTER on LB than expected from CV")
print(f"  What was different about exp_000?")
print(f"    - Baseline MLP with Arrhenius Kinetics + TTA")
print(f"    - Spange descriptors only (no DRFP, no ACS PCA)")
print(f"    - Simpler model")

In [None]:
# Strategic recommendations
print("="*60)
print("STRATEGIC RECOMMENDATIONS FOR LOOP 57")
print("="*60)

print("\n1. TRY PREDICTION CALIBRATION (Isotonic Regression):")
print("   - Train the best model (exp_030)")
print("   - Use CV predictions to fit isotonic regression")
print("   - Apply calibration to test predictions")
print("   - This explicitly corrects systematic bias")

print("\n2. TRY QUANTILE REGRESSION:")
print("   - Train model with quantile loss (e.g., median)")
print("   - May produce more robust predictions")
print("   - Different loss function could change CV-LB relationship")

print("\n3. TRY STACKING META-LEARNER:")
print("   - Train multiple base models (GP, MLP, LGBM)")
print("   - Use their predictions as features for a meta-learner")
print("   - Meta-learner could learn to correct systematic biases")

print("\n4. SUBMISSION STRATEGY:")
print("   - 5 submissions remaining")
print("   - Try 2-3 fundamentally different approaches")
print("   - Save 2 submissions for final attempts")
print("   - Focus on approaches that might change the CV-LB relationship")

In [None]:
# Final summary
print("="*60)
print("LOOP 57 SUMMARY")
print("="*60)

print("\nCurrent Status:")
print(f"  - Best CV: 0.008298 (exp_030)")
print(f"  - Best LB: 0.0877 (exp_030)")
print(f"  - Target LB: 0.0347")
print(f"  - Gap: 2.53x")
print(f"  - Submissions remaining: 5")
print(f"  - Consecutive failures: 26")

print("\nKey Findings:")
print("  1. CV-LB relationship: LB = 4.31*CV + 0.0525")
print("  2. Intercept (0.0525) > Target (0.0347) - target unreachable with current approach")
print("  3. exp_056 (XGBoost + RF) FAILED - CV 0.014233 (71.5% worse)")
print("  4. The 'mixall' kernel uses GroupKFold(5) instead of Leave-One-Out(24)")
print("  5. exp_000 has the best residual - simpler model performed better on LB")

print("\nRecommended Next Steps:")
print("  1. Try prediction calibration (isotonic regression)")
print("  2. Try quantile regression")
print("  3. Try stacking meta-learner")
print("  4. Focus on approaches that might change the CV-LB relationship")