# Loop 50 Analysis: Submission Failure Investigation

**Problem:** exp_049 (CatBoost + XGBoost Ensemble) failed with "Evaluation metric raised an unexpected error"

**Root Cause:** The experiment used RAMP NUM (87 folds) for full data CV instead of solvent pairs (13 folds) as required by the official `generate_leave_one_ramp_out_splits` function.

In [1]:
import pandas as pd
import numpy as np

# Load data
DATA_PATH = '/home/data'
df_single = pd.read_csv(f'{DATA_PATH}/catechol_single_solvent_yields.csv')
df_full = pd.read_csv(f'{DATA_PATH}/catechol_full_data_yields.csv')

print('Single solvent data:')
print(f'  Shape: {df_single.shape}')
print(f'  Unique solvents: {df_single["SOLVENT NAME"].nunique()}')
print()
print('Full data:')
print(f'  Shape: {df_full.shape}')
print(f'  Unique RAMP NUM: {df_full["RAMP NUM"].nunique()}')
print(f'  Unique solvent pairs: {df_full[["SOLVENT A NAME", "SOLVENT B NAME"]].drop_duplicates().shape[0]}')
print()
print('CRITICAL: The official CV uses solvent pairs (13 folds), not RAMP NUM (87 folds)!')

Single solvent data:
  Shape: (656, 13)
  Unique solvents: 24

Full data:
  Shape: (1227, 19)
  Unique RAMP NUM: 87
  Unique solvent pairs: 13

CRITICAL: The official CV uses solvent pairs (13 folds), not RAMP NUM (87 folds)!


In [2]:
# Check the official split function behavior
import sys
sys.path.insert(0, DATA_PATH)

# Manually implement the official split to understand it
INPUT_LABELS_FULL_SOLVENT = [
    "Residence Time", "Temperature", "SOLVENT A NAME", "SOLVENT B NAME", "SolventB%"
]
TARGET_LABELS = ["Product 2", "Product 3", "SM"]

X_full = df_full[INPUT_LABELS_FULL_SOLVENT]
Y_full = df_full[TARGET_LABELS]

# Count folds using official method
all_solvent_ramps = X_full[["SOLVENT A NAME", "SOLVENT B NAME"]].drop_duplicates()
all_solvent_ramps = all_solvent_ramps.sort_values(by=["SOLVENT A NAME", "SOLVENT B NAME"])

print('Official CV folds (solvent pairs):')
for i, (_, pair) in enumerate(all_solvent_ramps.iterrows()):
    mask = (X_full[["SOLVENT A NAME", "SOLVENT B NAME"]] == pair).all(axis=1)
    n_samples = mask.sum()
    print(f'  Fold {i}: {pair["SOLVENT A NAME"]} + {pair["SOLVENT B NAME"]} ({n_samples} samples)')

print(f'\nTotal folds: {len(all_solvent_ramps)}')

Official CV folds (solvent pairs):
  Fold 0: 1,1,1,3,3,3-Hexafluoropropan-2-ol + 2-Methyltetrahydrofuran [2-MeTHF] (124 samples)
  Fold 1: 2,2,2-Trifluoroethanol + Water.2,2,2-Trifluoroethanol (125 samples)
  Fold 2: 2-Methyltetrahydrofuran [2-MeTHF] + Diethyl Ether [Ether] (124 samples)
  Fold 3: Acetonitrile + Acetonitrile.Acetic Acid (125 samples)
  Fold 4: Cyclohexane + IPA [Propan-2-ol] (104 samples)
  Fold 5: DMA [N,N-Dimethylacetamide] + Decanol (110 samples)
  Fold 6: Dihydrolevoglucosenone (Cyrene) + Ethyl Acetate (36 samples)
  Fold 7: Ethanol + THF [Tetrahydrofuran] (127 samples)
  Fold 8: MTBE [tert-Butylmethylether] + Butanone [MEK] (34 samples)
  Fold 9: Methanol + Ethylene Glycol [1,2-Ethanediol] (122 samples)
  Fold 10: Methyl Propionate + Ethyl Lactate (35 samples)
  Fold 11: Water.Acetonitrile + Acetonitrile (125 samples)
  Fold 12: tert-Butanol [2-Methylpropan-2-ol] + Dimethyl Carbonate (36 samples)

Total folds: 13


In [3]:
# Analyze CV-LB relationship from all submissions
import json

with open('/home/code/session_state.json') as f:
    state = json.load(f)

submissions = state.get('submissions', [])
print('CV-LB Relationship Analysis:')
print('='*60)

cv_scores = []
lb_scores = []
for s in submissions:
    cv = s.get('cv_score')
    lb = s.get('lb_score')
    if cv and lb:
        cv_scores.append(cv)
        lb_scores.append(lb)
        print(f"{s.get('experiment_id')}: CV={cv:.6f}, LB={lb:.6f}")

if len(cv_scores) >= 3:
    from scipy import stats
    slope, intercept, r_value, p_value, std_err = stats.linregress(cv_scores, lb_scores)
    print(f'\nLinear fit: LB = {slope:.2f} * CV + {intercept:.4f}')
    print(f'R-squared = {r_value**2:.4f}')
    print(f'Intercept = {intercept:.4f}')
    print(f'Target = 0.0347')
    print(f'\nTo hit target 0.0347:')
    required_cv = (0.0347 - intercept) / slope
    print(f'  Required CV = {required_cv:.6f}')
    print(f'  Current best CV = {min(cv_scores):.6f}')
    print(f'  Gap = {min(cv_scores) - required_cv:.6f}')

CV-LB Relationship Analysis:
exp_000: CV=0.011081, LB=0.098160
exp_001: CV=0.012297, LB=0.106490
exp_003: CV=0.010501, LB=0.097190
exp_005: CV=0.010430, LB=0.096910
exp_006: CV=0.009749, LB=0.094570
exp_007: CV=0.009262, LB=0.093160
exp_009: CV=0.009192, LB=0.093640
exp_012: CV=0.009004, LB=0.091340
exp_024: CV=0.008689, LB=0.089290
exp_026: CV=0.008465, LB=0.088750
exp_030: CV=0.008298, LB=0.087720
exp_035: CV=0.009825, LB=0.096960



Linear fit: LB = 4.29 * CV + 0.0528
R-squared = 0.9523
Intercept = 0.0528
Target = 0.0347

To hit target 0.0347:
  Required CV = -0.004218
  Current best CV = 0.008298
  Gap = 0.012516


In [4]:
# Key insight: The intercept problem
print('KEY INSIGHT: The Intercept Problem')
print('='*60)
print()
print('The CV-LB relationship shows a consistent intercept of ~0.053.')
print('This means even with CV=0 (perfect training), LB would be ~0.053.')
print()
print('The target (0.0347) is BELOW this intercept!')
print('This suggests the target may be unreachable with current approaches.')
print()
print('HOWEVER, the target IS reachable because:')
print('1. The benchmark achieved MSE 0.0039 (much better than our best)')
print('2. Top Kaggle kernels may have different CV-LB relationships')
print('3. The intercept represents distribution shift that CAN be addressed')
print()
print('STRATEGIES TO REDUCE THE INTERCEPT:')
print('1. Use the EXACT official CV scheme (not custom implementations)')
print('2. Study top kernels that actually score well on LB')
print('3. Focus on approaches that generalize to unseen solvents')
print('4. Consider domain adaptation techniques')

KEY INSIGHT: The Intercept Problem

The CV-LB relationship shows a consistent intercept of ~0.053.
This means even with CV=0 (perfect training), LB would be ~0.053.

The target (0.0347) is BELOW this intercept!
This suggests the target may be unreachable with current approaches.

HOWEVER, the target IS reachable because:
1. The benchmark achieved MSE 0.0039 (much better than our best)
2. Top Kaggle kernels may have different CV-LB relationships
3. The intercept represents distribution shift that CAN be addressed

STRATEGIES TO REDUCE THE INTERCEPT:
1. Use the EXACT official CV scheme (not custom implementations)
2. Study top kernels that actually score well on LB
3. Focus on approaches that generalize to unseen solvents
4. Consider domain adaptation techniques


In [5]:
# Check what the top kernels actually do
print('Top Kernel Analysis:')
print('='*60)
print()
print('From our research, the top kernels use:')
print('1. ens-model: CatBoost + XGBoost ensemble with official CV')
print('2. mixall: GroupKFold (5 splits) - DIFFERENT from official CV!')
print('3. Arrhenius kinetics features')
print('4. Combined feature tables (spange + acs + drfps + fragprints)')
print()
print('CRITICAL: The mixall kernel OVERWRITES the utility functions!')
print('This means their local CV is NOT comparable to ours.')
print()
print('The ens-model kernel uses the STANDARD official CV.')
print('Our exp_049 implementation was correct for single solvents,')
print('but WRONG for full data (used RAMP NUM instead of solvent pairs).')

Top Kernel Analysis:

From our research, the top kernels use:
1. ens-model: CatBoost + XGBoost ensemble with official CV
2. mixall: GroupKFold (5 splits) - DIFFERENT from official CV!
3. Arrhenius kinetics features
4. Combined feature tables (spange + acs + drfps + fragprints)

CRITICAL: The mixall kernel OVERWRITES the utility functions!
This means their local CV is NOT comparable to ours.

The ens-model kernel uses the STANDARD official CV.
Our exp_049 implementation was correct for single solvents,
but WRONG for full data (used RAMP NUM instead of solvent pairs).


In [None]:
# Plan for next experiment
print('PLAN FOR NEXT EXPERIMENT (exp_050):')
print('='*60)
print()
print('1. FIX THE BUG: Use official generate_leave_one_ramp_out_splits')
print('   - This uses solvent pairs (13 folds), not RAMP NUM (87 folds)')
print()
print('2. KEEP THE MODEL: CatBoost + XGBoost ensemble is promising')
print('   - CV 0.008092 for single solvents is our best yet')
print()
print('3. VERIFY SUBMISSION FORMAT:')
print('   - task=0 for single solvents (24 folds)')
print('   - task=1 for full data (13 folds, NOT 87!)')
print()
print('4. EXPECTED OUTCOME:')
print('   - If CV-LB relationship holds: LB = 4.29 * 0.008 + 0.053 = 0.087')
print('   - This would be similar to exp_030 (LB 0.0877)')
print('   - The fix should allow the submission to score properly')