# Loop 28 Analysis: Understanding the CV-LB Gap

**Current State:**
- Best CV: 0.008465 (exp_026)
- Best LB: 0.0887 (exp_026)
- Target: 0.01727
- CV-LB ratio: ~10.5x
- Linear fit: LB = 4.22*CV + 0.0533 (R²=0.96)

**Critical Insight:**
The intercept (0.0533) is 3x higher than target (0.01727). This means even with CV=0, predicted LB would be 0.0533.

**Latest Experiment (exp_027):**
- Tested simpler features (23 vs 145) - FAILED
- CV 0.009150 (8.09% worse than exp_026)
- DRFP features ARE valuable

**Key Question:**
What can we do to REDUCE the CV-LB gap, not just improve CV?

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

# All 10 submissions with CV and LB scores
submissions = [
    {'id': 'exp_000', 'cv': 0.011081, 'lb': 0.09816},
    {'id': 'exp_001', 'cv': 0.012297, 'lb': 0.10649},
    {'id': 'exp_003', 'cv': 0.010501, 'lb': 0.09719},
    {'id': 'exp_005', 'cv': 0.01043, 'lb': 0.09691},
    {'id': 'exp_006', 'cv': 0.009749, 'lb': 0.09457},
    {'id': 'exp_007', 'cv': 0.009262, 'lb': 0.09316},
    {'id': 'exp_009', 'cv': 0.009192, 'lb': 0.09364},
    {'id': 'exp_012', 'cv': 0.009004, 'lb': 0.09134},
    {'id': 'exp_024', 'cv': 0.008689, 'lb': 0.08929},
    {'id': 'exp_026', 'cv': 0.008465, 'lb': 0.08875},
]

df = pd.DataFrame(submissions)
print('All submissions:')
print(df.to_string(index=False))

# Linear fit
cv = df['cv'].values
lb = df['lb'].values
slope, intercept, r_value, p_value, std_err = stats.linregress(cv, lb)
print(f'\nLinear fit: LB = {slope:.4f} * CV + {intercept:.5f}')
print(f'R² = {r_value**2:.4f}')

In [None]:
# Analyze the residuals - which experiments deviate from the linear fit?
print('=== Residual Analysis ===')
predicted_lb = slope * cv + intercept
residuals = lb - predicted_lb

for i, row in df.iterrows():
    print(f'{row["id"]}: Residual = {residuals[i]:.5f} ({"better" if residuals[i] < 0 else "worse"} than predicted)')

print(f'\nBest residual: {df.iloc[residuals.argmin()]["id"]} ({residuals.min():.5f})')
print(f'Worst residual: {df.iloc[residuals.argmax()]["id"]} ({residuals.max():.5f})')

In [None]:
# What's special about experiments with negative residuals (better than predicted)?
print('=== Experiments with Negative Residuals (Better Generalization) ===')
for i, row in df.iterrows():
    if residuals[i] < 0:
        print(f'{row["id"]}: CV={row["cv"]:.6f}, LB={row["lb"]:.5f}, Residual={residuals[i]:.5f}')

print('\n=== Experiments with Positive Residuals (Worse Generalization) ===')
for i, row in df.iterrows():
    if residuals[i] > 0:
        print(f'{row["id"]}: CV={row["cv"]:.6f}, LB={row["lb"]:.5f}, Residual={residuals[i]:.5f}')

In [None]:
# What approaches were used in experiments with better generalization?
print('=== Approach Analysis ===')
approaches = {
    'exp_000': 'MLP [128,128,64], Spange only, HuberLoss, 3 models',
    'exp_001': 'LightGBM, Spange only',
    'exp_003': 'MLP [256,128,64], Spange+DRFP, HuberLoss, 5 models',
    'exp_005': 'MLP [256,128,64], Spange+DRFP, HuberLoss, 15 models',
    'exp_006': 'MLP [64,32], Spange+DRFP, HuberLoss, 5 models',
    'exp_007': 'MLP [32,16], Spange+DRFP, HuberLoss, 5 models',
    'exp_009': 'Ridge Regression, Spange+DRFP',
    'exp_012': 'MLP [32,16] + LightGBM ensemble, Spange+DRFP',
    'exp_024': 'MLP [32,16] + LightGBM, Spange+DRFP+ACS_PCA',
    'exp_026': 'MLP [32,16] + LightGBM, Spange+DRFP+ACS_PCA, Weighted Loss [1,1,2]',
}

print('\nBetter generalization (negative residuals):')
for i, row in df.iterrows():
    if residuals[i] < 0:
        print(f'  {row["id"]}: {approaches[row["id"]]}')

print('\nWorse generalization (positive residuals):')
for i, row in df.iterrows():
    if residuals[i] > 0:
        print(f'  {row["id"]}: {approaches[row["id"]]}')

In [None]:
# Key insight: What's the pattern?
print('=== Pattern Analysis ===')
print('\nObservations:')
print('1. exp_000 (Spange only, simpler) has negative residual')
print('2. exp_003, exp_005 (larger models) have negative residuals')
print('3. exp_009 (Ridge Regression) has positive residual')
print('4. exp_007 (simpler MLP) has positive residual')
print('5. exp_024, exp_026 (best CV) have negative residuals')

print('\nConclusion:')
print('- No clear pattern between model complexity and generalization')
print('- The residuals are small (RMSE ~0.001) - the linear fit is very tight')
print('- The CV-LB gap is STRUCTURAL, not due to specific model choices')
print('- The gap is likely due to evaluation procedure differences')

In [None]:
# What if the evaluation uses a different CV scheme?
print('=== Hypothesis: Different CV Scheme ===')
print('\nOur CV scheme:')
print('- Single solvents: Leave-one-solvent-out (24 folds)')
print('- Mixtures: Leave-one-ramp-out (13 folds)')
print('- Total: 37 folds')

print('\nPossible LB CV scheme:')
print('- GroupKFold (5 folds) as seen in "mixall" kernel')
print('- Different random seed')
print('- Different data ordering')

print('\nKey insight from "mixall" kernel:')
print('- Uses GroupKFold(n_splits=5) instead of Leave-One-Out')
print('- This is a DIFFERENT CV scheme!')
print('- Our local CV may not match the LB evaluation')

In [None]:
# What approaches remain unexplored?
print('=== UNEXPLORED Approaches ===')
unexplored = [
    ('XGBoost/CatBoost ensemble', 'Different tree algorithms may generalize differently'),
    ('Stacking meta-learner', 'Train a meta-model on base predictions'),
    ('Higher SM weights [1,1,3] or [1,1,4]', 'SM is still the bottleneck'),
    ('Learned loss weights (homoscedastic)', 'Kendall et al. uncertainty weighting'),
    ('Consistency constraint (SM+P2+P3≈1)', 'Physical constraint for regularization'),
    ('Different CV scheme (GroupKFold)', 'May match LB evaluation better'),
    ('Adversarial validation', 'Identify features causing distribution shift'),
    ('Domain adaptation techniques', 'Handle distribution shift explicitly'),
]

for approach, rationale in unexplored:
    print(f'\n{approach}:')
    print(f'  Rationale: {rationale}')

In [None]:
# Priority ranking based on potential impact
print('=== Priority Ranking ===')
print('\n1. HIGHEST PRIORITY: XGBoost/CatBoost Ensemble')
print('   - We only have MLP + LightGBM')
print('   - XGBoost and CatBoost are different algorithms')
print('   - May capture different patterns')
print('   - Easy to implement')

print('\n2. HIGH PRIORITY: Higher SM Weights [1,1,3]')
print('   - SM is still the hardest target')
print('   - Weighted loss [1,1,2] improved all targets')
print('   - More aggressive weighting may help further')

print('\n3. MEDIUM PRIORITY: Stacking Meta-Learner')
print('   - Train a simple model on base predictions')
print('   - Can learn optimal combination weights')
print('   - May improve generalization')

print('\n4. LOWER PRIORITY: Consistency Constraint')
print('   - SM + P2 + P3 ≈ 1 (mass balance)')
print('   - Physical constraint for regularization')
print('   - May improve predictions near boundaries')

In [None]:
# Final recommendation
print('=== FINAL RECOMMENDATION ===')
print('\nGiven:')
print('- 3 submissions remaining')
print('- CV-LB gap is ~10x (structural, not model-specific)')
print('- Target 0.01727 is 5x better than best LB 0.0887')

print('\nStrategy:')
print('1. Try XGBoost/CatBoost ensemble (new model diversity)')
print('2. Try higher SM weights [1,1,3] (target the bottleneck)')
print('3. Try stacking meta-learner (optimal combination)')

print('\nKey insight:')
print('The CV-LB gap is the fundamental problem.')
print('We need approaches that GENERALIZE better, not just improve CV.')
print('New model types (XGBoost, CatBoost) may have different generalization properties.')