# Loop 37 Analysis: Diagnosing the CV-LB Gap

**Current State:**
- Best CV: 0.008194 (exp_032)
- Best LB: 0.0877 (exp_030)
- Target: 0.0347
- CV-LB relationship: LB = 4.27×CV + 0.0527 (R²=0.967)

**Key Problem:** The intercept (0.0527) is 1.52x higher than the target (0.0347). Even with CV=0, the predicted LB would be 0.0527.

**This analysis will:**
1. Examine the CV-LB relationship in detail
2. Identify what's causing the intercept
3. Explore approaches that could reduce the intercept

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

# All submissions data
submissions = [
    {'exp': 'exp_000', 'cv': 0.011081, 'lb': 0.09816},
    {'exp': 'exp_001', 'cv': 0.012297, 'lb': 0.10649},
    {'exp': 'exp_003', 'cv': 0.010501, 'lb': 0.09719},
    {'exp': 'exp_005', 'cv': 0.010430, 'lb': 0.09691},
    {'exp': 'exp_006', 'cv': 0.009749, 'lb': 0.09457},
    {'exp': 'exp_007', 'cv': 0.009262, 'lb': 0.09316},
    {'exp': 'exp_009', 'cv': 0.009192, 'lb': 0.09364},
    {'exp': 'exp_012', 'cv': 0.009004, 'lb': 0.09134},
    {'exp': 'exp_024', 'cv': 0.008689, 'lb': 0.08929},
    {'exp': 'exp_026', 'cv': 0.008465, 'lb': 0.08875},
    {'exp': 'exp_030', 'cv': 0.008298, 'lb': 0.08772},
]

df = pd.DataFrame(submissions)
print(f'Total submissions: {len(df)}')
print(df)

In [None]:
# Fit linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(df['cv'], df['lb'])
print(f'\nCV-LB Relationship:')
print(f'  LB = {slope:.2f} × CV + {intercept:.4f}')
print(f'  R² = {r_value**2:.4f}')
print(f'  Intercept = {intercept:.4f}')
print(f'  Target = 0.0347')
print(f'  Intercept / Target = {intercept / 0.0347:.2f}x')

# What CV would we need to reach target?
cv_needed = (0.0347 - intercept) / slope
print(f'\nTo reach target LB = 0.0347:')
print(f'  CV needed = {cv_needed:.6f}')
if cv_needed < 0:
    print(f'  IMPOSSIBLE with current approach (would need negative CV)')

In [None]:
# Plot CV-LB relationship
plt.figure(figsize=(10, 6))
plt.scatter(df['cv'], df['lb'], s=100, c='blue', alpha=0.7, label='Submissions')

# Fit line
cv_range = np.linspace(0, 0.015, 100)
lb_pred = slope * cv_range + intercept
plt.plot(cv_range, lb_pred, 'r--', label=f'LB = {slope:.2f}×CV + {intercept:.4f}')

# Target line
plt.axhline(y=0.0347, color='green', linestyle=':', linewidth=2, label='Target LB = 0.0347')

# Intercept point
plt.scatter([0], [intercept], s=150, c='red', marker='x', label=f'Intercept = {intercept:.4f}')

plt.xlabel('CV Score (MSE)')
plt.ylabel('LB Score (MSE)')
plt.title('CV-LB Relationship: The Intercept Problem')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('/home/code/exploration/loop37_cv_lb.png', dpi=100, bbox_inches='tight')
plt.show()

print(f'\nKey Insight: The intercept ({intercept:.4f}) is {intercept/0.0347:.1f}x higher than the target.')
print(f'Even with perfect CV=0, the LB would still be {intercept:.4f}, which is above the target.')

In [None]:
# Analyze what's causing the intercept
# The intercept represents the "base error" that doesn't depend on CV
# This is likely due to:
# 1. Distribution shift between train and test solvents
# 2. Systematic overfitting to training solvents
# 3. Features that work well on training but not test solvents

print('=== Analyzing the Intercept ===')
print()
print('The intercept (0.0527) represents the "base error" that persists regardless of CV.')
print('This is likely caused by:')
print('  1. Distribution shift between train and test solvents')
print('  2. Systematic overfitting to training solvents')
print('  3. Features that work well on training but not test solvents')
print()
print('To reduce the intercept, we need approaches that:')
print('  - Generalize better to unseen solvents')
print('  - Are less sensitive to solvent-specific patterns')
print('  - Use more robust features')

In [None]:
# What approaches have been tried?
approaches = [
    ('exp_000', 'MLP baseline', 0.011081, 0.09816),
    ('exp_001', 'LightGBM', 0.012297, 0.10649),
    ('exp_003', 'Combined features', 0.010501, 0.09719),
    ('exp_006', 'Simpler MLP [64,32]', 0.009749, 0.09457),
    ('exp_007', 'Even simpler [32,16]', 0.009262, 0.09316),
    ('exp_012', 'Simple ensemble', 0.009004, 0.09134),
    ('exp_024', 'ACS PCA features', 0.008689, 0.08929),
    ('exp_026', 'Weighted loss', 0.008465, 0.08875),
    ('exp_030', 'GP+MLP+LGBM', 0.008298, 0.08772),
]

print('=== Approaches Tried ===')
for exp, name, cv, lb in approaches:
    lb_pred = slope * cv + intercept
    residual = lb - lb_pred
    print(f'{exp}: {name}')
    print(f'  CV={cv:.6f}, LB={lb:.5f}, Predicted={lb_pred:.5f}, Residual={residual:+.5f}')
    print()

In [None]:
# Calculate residuals to see if any approach beats the linear relationship
df['lb_pred'] = slope * df['cv'] + intercept
df['residual'] = df['lb'] - df['lb_pred']

print('=== Residual Analysis ===')
print('Positive residual = worse than expected')
print('Negative residual = better than expected')
print()
print(df[['exp', 'cv', 'lb', 'lb_pred', 'residual']].sort_values('residual'))

print(f'\nBest residual: {df["residual"].min():.5f} ({df.loc[df["residual"].idxmin(), "exp"]})')
print(f'Worst residual: {df["residual"].max():.5f} ({df.loc[df["residual"].idxmax(), "exp"]})')
print(f'Mean residual: {df["residual"].mean():.5f}')
print(f'Std residual: {df["residual"].std():.5f}')

In [None]:
# What approaches haven't been tried?
print('=== Unexplored Approaches ===')
print()
print('1. k-Nearest Neighbors (k-NN)')
print('   - Completely different inductive bias')
print('   - Predicts based on similar solvents in feature space')
print('   - May generalize differently to unseen solvents')
print()
print('2. Kernel Ridge Regression with RBF kernel')
print('   - Non-parametric, kernel-based approach')
print('   - Similar to GP but without uncertainty')
print('   - May capture non-linear relationships better')
print()
print('3. Solvent Clustering + Per-Cluster Models')
print('   - Group solvents by similarity')
print('   - Train separate models per cluster')
print('   - Reduces distribution shift within each cluster')
print()
print('4. Adversarial Validation')
print('   - Identify features that distinguish train vs test solvents')
print('   - Remove or down-weight those features')
print('   - May reduce the distribution shift')
print()
print('5. Meta-Learning / MAML')
print('   - Learn to adapt quickly to new solvents')
print('   - May generalize better to unseen solvents')

In [None]:
# Key insight: The feature selection experiment (exp_036) was 16.83% worse in CV
# but we don't know if it would be better on LB
# The hypothesis was that simpler models would have a lower intercept

print('=== Feature Selection Experiment (exp_036) ===')
print()
print('CV: 0.009573 (16.83% worse than best CV 0.008194)')
print()
print('Predicted LB using current relationship:')
lb_pred_036 = slope * 0.009573 + intercept
print(f'  LB = {slope:.2f} × 0.009573 + {intercept:.4f} = {lb_pred_036:.5f}')
print()
print('If the intercept is reduced by feature selection:')
for new_intercept in [0.04, 0.03, 0.02]:
    new_lb = slope * 0.009573 + new_intercept
    print(f'  Intercept={new_intercept:.2f}: LB = {new_lb:.5f}')
print()
print('Key question: Does feature selection reduce the intercept?')
print('This can only be answered by submitting exp_036.')

In [None]:
# Summary and recommendations
print('=== SUMMARY ===')
print()
print('Current State:')
print(f'  - Best CV: 0.008194 (exp_032)')
print(f'  - Best LB: 0.0877 (exp_030)')
print(f'  - Target: 0.0347')
print(f'  - Gap: {0.0877/0.0347:.1f}x')
print()
print('The Problem:')
print(f'  - CV-LB relationship: LB = {slope:.2f}×CV + {intercept:.4f}')
print(f'  - Intercept ({intercept:.4f}) > Target (0.0347)')
print(f'  - Even with CV=0, LB would be {intercept:.4f}')
print()
print('Recommendations:')
print('  1. Submit exp_036 (feature selection) to test if it reduces intercept')
print('  2. Try k-NN as a completely different approach')
print('  3. Try adversarial validation to identify distribution shift')
print('  4. Consider solvent clustering for per-cluster models')
print()
print('Remaining submissions: 5')
print('Each submission is precious - use strategically!')