# Loop 59 Analysis: Understanding the CV-LB Gap and Path Forward

## Key Questions:
1. Why did Spange-only (exp_058) perform 37.5% worse than best CV?
2. What is the true CV-LB relationship?
3. What approaches haven't been tried that could reduce the gap?
4. What is the path to reaching target 0.0347?

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Submission history
submissions = [
    {'exp': 'exp_000', 'cv': 0.0111, 'lb': 0.0982, 'name': 'Baseline MLP (Spange only)'},
    {'exp': 'exp_001', 'cv': 0.0123, 'lb': 0.1065, 'name': 'LightGBM'},
    {'exp': 'exp_003', 'cv': 0.0105, 'lb': 0.0972, 'name': 'Combined Spange+DRFP'},
    {'exp': 'exp_005', 'cv': 0.0104, 'lb': 0.0969, 'name': 'Large Ensemble'},
    {'exp': 'exp_006', 'cv': 0.0097, 'lb': 0.0946, 'name': 'Simpler Model'},
    {'exp': 'exp_007', 'cv': 0.0093, 'lb': 0.0932, 'name': 'Even Simpler'},
    {'exp': 'exp_009', 'cv': 0.0092, 'lb': 0.0936, 'name': 'Ridge Regression'},
    {'exp': 'exp_012', 'cv': 0.0090, 'lb': 0.0913, 'name': 'Simple Ensemble'},
    {'exp': 'exp_024', 'cv': 0.0087, 'lb': 0.0893, 'name': 'ACS PCA Fixed'},
    {'exp': 'exp_026', 'cv': 0.0085, 'lb': 0.0887, 'name': 'Weighted Loss'},
    {'exp': 'exp_030', 'cv': 0.0083, 'lb': 0.0877, 'name': 'GP Ensemble'},
    {'exp': 'exp_041', 'cv': 0.0090, 'lb': 0.0932, 'name': 'Aggressive Regularization'},
    {'exp': 'exp_042', 'cv': 0.0145, 'lb': 0.1147, 'name': 'GroupKFold CV'},
]

df = pd.DataFrame(submissions)
print('=== Submission History ===')
print(df.to_string(index=False))

In [None]:
# Analyze CV-LB relationship
from scipy import stats

cv = df['cv'].values
lb = df['lb'].values

slope, intercept, r_value, p_value, std_err = stats.linregress(cv, lb)

print(f'\n=== CV-LB Relationship ===')
print(f'LB = {slope:.4f} * CV + {intercept:.4f}')
print(f'R² = {r_value**2:.4f}')
print(f'\nInterpretation:')
print(f'- Intercept: {intercept:.4f}')
print(f'- Target: 0.0347')
print(f'- Gap: {intercept - 0.0347:.4f} ({(intercept - 0.0347)/0.0347*100:.1f}% above target)')
print(f'\nTo reach target 0.0347:')
required_cv = (0.0347 - intercept) / slope
print(f'- Required CV: {required_cv:.6f}')
if required_cv < 0:
    print(f'- IMPOSSIBLE: Required CV is negative!')
    print(f'- Even with CV=0, predicted LB would be {intercept:.4f}')

In [None]:
# Visualize CV-LB relationship
plt.figure(figsize=(10, 6))
plt.scatter(cv, lb, s=100, c='blue', alpha=0.7, label='Submissions')

# Regression line
cv_range = np.linspace(0, max(cv)*1.1, 100)
lb_pred = slope * cv_range + intercept
plt.plot(cv_range, lb_pred, 'r--', label=f'LB = {slope:.2f}*CV + {intercept:.4f}')

# Target line
plt.axhline(y=0.0347, color='green', linestyle='-', linewidth=2, label='Target (0.0347)')

# Best CV and LB
plt.scatter([0.008194], [slope*0.008194 + intercept], s=200, c='orange', marker='*', label=f'Best CV (0.008194) -> Predicted LB: {slope*0.008194 + intercept:.4f}')

plt.xlabel('CV Score')
plt.ylabel('LB Score')
plt.title('CV vs LB Relationship')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('/home/code/exploration/cv_lb_relationship.png', dpi=150)
plt.show()

print(f'\nBest CV: 0.008194 -> Predicted LB: {slope*0.008194 + intercept:.4f}')
print(f'Best LB achieved: 0.0877')
print(f'Target: 0.0347')

In [None]:
# Analyze generalization residuals
df['predicted_lb'] = slope * df['cv'] + intercept
df['residual'] = df['lb'] - df['predicted_lb']
df['residual_pct'] = df['residual'] / df['predicted_lb'] * 100

print('=== Generalization Residuals ===')
print('Negative residual = better generalization than expected')
print(df[['exp', 'name', 'cv', 'lb', 'predicted_lb', 'residual', 'residual_pct']].sort_values('residual').to_string(index=False))

print(f'\n=== Key Insights ===')
best_gen = df.loc[df['residual'].idxmin()]
print(f'Best generalization: {best_gen["exp"]} ({best_gen["name"]})')
print(f'  - Residual: {best_gen["residual"]:.4f} ({best_gen["residual_pct"]:.1f}%)')
print(f'  - CV: {best_gen["cv"]:.4f}, LB: {best_gen["lb"]:.4f}')

In [None]:
# What would it take to reach the target?
print('=== Path to Target 0.0347 ===')
print(f'\nCurrent best LB: 0.0877')
print(f'Target: 0.0347')
print(f'Gap: {0.0877 - 0.0347:.4f} ({(0.0877 - 0.0347)/0.0347*100:.1f}% above target)')

print(f'\n=== Options to Reach Target ===')
print(f'\n1. Reduce CV (current approach):')
print(f'   - Current best CV: 0.008194')
print(f'   - Required CV: {required_cv:.6f} (IMPOSSIBLE - negative)')
print(f'   - This approach CANNOT reach target')

print(f'\n2. Reduce intercept (improve generalization):')
print(f'   - Current intercept: {intercept:.4f}')
print(f'   - Required intercept: < 0.0347 (to have any chance)')
print(f'   - Need to reduce intercept by: {intercept - 0.0347:.4f}')

print(f'\n3. Reduce slope (make CV more predictive):')
print(f'   - Current slope: {slope:.4f}')
print(f'   - If intercept stays same, need slope < 0 (IMPOSSIBLE)')

print(f'\n4. Change the CV-LB relationship fundamentally:')
print(f'   - The current relationship is highly predictable (R²={r_value**2:.4f})')
print(f'   - Need a fundamentally different approach that breaks this relationship')

In [None]:
# Analyze what the "mixall" kernel does differently
print('=== Analysis of "mixall" Kernel Approach ===')
print(f'\nKey difference: Uses GroupKFold(5) instead of Leave-One-Out')
print(f'\nLeave-One-Out CV:')
print(f'  - Single solvent: 24 folds (one per solvent)')
print(f'  - Full data: 13 folds (one per ramp)')
print(f'  - Total: 37 folds')
print(f'\nGroupKFold(5) CV:')
print(f'  - Single solvent: 5 folds')
print(f'  - Full data: 5 folds')
print(f'  - Total: 10 folds')

print(f'\nImplications:')
print(f'  - GroupKFold has MORE training data per fold (80% vs ~96%)')
print(f'  - GroupKFold has LARGER test sets per fold')
print(f'  - GroupKFold may give HIGHER CV scores but BETTER generalization')
print(f'\nWe tested GroupKFold in exp_042:')
print(f'  - CV: 0.0145 (77% worse than best)')
print(f'  - LB: 0.1147 (31% worse than best)')
print(f'  - This did NOT help!')

In [None]:
# What approaches haven't been tried?
print('=== Unexplored Approaches ===')

print(f'\n1. DIFFERENT ENSEMBLE ARCHITECTURES:')
print(f'   - Current best: GP(0.15) + MLP(0.55) + LGBM(0.30)')
print(f'   - NOT tried: Stacking with neural network meta-learner')
print(f'   - NOT tried: Blending with learned weights per sample')

print(f'\n2. DIFFERENT FEATURE ENGINEERING:')
print(f'   - Tried: Spange, DRFP, ACS PCA, RDKit, fragprints')
print(f'   - NOT tried: Learned embeddings from scratch')
print(f'   - NOT tried: Contrastive learning for solvent embeddings')

print(f'\n3. DIFFERENT TRAINING STRATEGIES:')
print(f'   - Tried: Standard training, regularization, per-target')
print(f'   - NOT tried: Meta-learning (MAML, Reptile)')
print(f'   - NOT tried: Domain adaptation techniques')

print(f'\n4. DIFFERENT MODEL ARCHITECTURES:')
print(f'   - Tried: MLP, LGBM, XGBoost, CatBoost, GP, GNN, ChemBERTa')
print(f'   - NOT tried: Transformer for tabular data')
print(f'   - NOT tried: TabNet')

print(f'\n5. POST-PROCESSING:')
print(f'   - Tried: Physical constraint normalization')
print(f'   - NOT tried: Calibration (Platt scaling, isotonic regression)')
print(f'   - NOT tried: Conformal prediction')

In [None]:
# Critical insight: The target IS reachable
print('=== CRITICAL INSIGHT ===')
print(f'\nThe target (0.0347) IS reachable because:')
print(f'1. The GNN benchmark achieved MSE 0.0039 (much better than target)')
print(f'2. Other competitors may have found approaches we haven\'t tried')
print(f'3. The CV-LB relationship is based on OUR experiments, not the true relationship')

print(f'\nThe problem is NOT that the target is unreachable.')
print(f'The problem is that OUR APPROACH has a fundamental limitation.')

print(f'\n=== What We Need ===')
print(f'1. A fundamentally different approach that changes the CV-LB relationship')
print(f'2. OR a way to significantly reduce the intercept (improve generalization)')
print(f'3. OR a completely different model architecture that captures different patterns')

print(f'\n=== Remaining Submissions: 3 ===')
print(f'We should NOT submit unless we have a fundamentally different approach.')
print(f'The current best (CV 0.008194) would give LB ~0.088, still far from target.')

In [None]:
# Summary and recommendations
print('=== SUMMARY AND RECOMMENDATIONS ===')

print(f'\n1. CURRENT SITUATION:')
print(f'   - Best CV: 0.008194 (exp_030)')
print(f'   - Best LB: 0.0877 (exp_030)')
print(f'   - Target: 0.0347')
print(f'   - Gap: 153% above target')

print(f'\n2. CV-LB RELATIONSHIP:')
print(f'   - LB = {slope:.2f}*CV + {intercept:.4f} (R²={r_value**2:.4f})')
print(f'   - Intercept ({intercept:.4f}) > Target (0.0347)')
print(f'   - CANNOT reach target by CV minimization alone')

print(f'\n3. FAILED APPROACHES:')
print(f'   - Simpler features (Spange only): 37.5% worse CV')
print(f'   - GroupKFold CV: 77% worse CV, 31% worse LB')
print(f'   - GNN: 72-266% worse CV')
print(f'   - ChemBERTa: 137-309% worse CV')

print(f'\n4. RECOMMENDED NEXT STEPS:')
print(f'   a. Try calibration techniques (Platt scaling, isotonic regression)')
print(f'   b. Try domain adaptation (adversarial training)')
print(f'   c. Try meta-learning (MAML, Reptile)')
print(f'   d. Try TabNet or other tabular transformers')
print(f'   e. Try learned embeddings with contrastive loss')

print(f'\n5. SUBMISSION STRATEGY:')
print(f'   - 3 submissions remaining')
print(f'   - DO NOT submit unless fundamentally different approach')
print(f'   - Save at least 1 for final attempt')