# Loop 39 Analysis: Understanding the CV-LB Gap

**Key Discovery**: The "mixall" kernel uses GroupKFold(5) instead of leave-one-out CV.

**Questions to answer:**
1. What is the CV score with GroupKFold(5) vs leave-one-out?
2. Does GroupKFold(5) better predict the LB score?
3. What is the actual evaluation procedure on Kaggle?

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import GroupKFold

# Load data
DATA_PATH = '/home/data'

df_single = pd.read_csv(f'{DATA_PATH}/catechol_single_solvent_yields.csv')
df_full = pd.read_csv(f'{DATA_PATH}/catechol_full_data_yields.csv')

print(f'Single solvent: {len(df_single)} samples, {df_single["SOLVENT NAME"].nunique()} solvents')
print(f'Full data: {len(df_full)} samples, {len(df_full[["SOLVENT A NAME", "SOLVENT B NAME"]].drop_duplicates())} ramps')

In [None]:
# Analyze the CV-LB relationship from all submissions
submissions = [
    {'exp': 'exp_000', 'cv': 0.0111, 'lb': 0.0982},
    {'exp': 'exp_001', 'cv': 0.0123, 'lb': 0.1065},
    {'exp': 'exp_003', 'cv': 0.0105, 'lb': 0.0972},
    {'exp': 'exp_005', 'cv': 0.0104, 'lb': 0.0969},
    {'exp': 'exp_006', 'cv': 0.0097, 'lb': 0.0946},
    {'exp': 'exp_007', 'cv': 0.0093, 'lb': 0.0932},
    {'exp': 'exp_009', 'cv': 0.0092, 'lb': 0.0936},
    {'exp': 'exp_012', 'cv': 0.0090, 'lb': 0.0913},
    {'exp': 'exp_024', 'cv': 0.0087, 'lb': 0.0893},
    {'exp': 'exp_026', 'cv': 0.0085, 'lb': 0.0887},
    {'exp': 'exp_030', 'cv': 0.0083, 'lb': 0.0877},
]

df_sub = pd.DataFrame(submissions)

# Linear fit
from scipy import stats
slope, intercept, r_value, p_value, std_err = stats.linregress(df_sub['cv'], df_sub['lb'])

print(f'CV-LB Relationship: LB = {slope:.2f} * CV + {intercept:.4f}')
print(f'R² = {r_value**2:.4f}')
print(f'\nTarget: 0.0347')
print(f'Intercept: {intercept:.4f}')
print(f'Gap: intercept ({intercept:.4f}) > target ({0.0347:.4f}) by {(intercept - 0.0347):.4f}')

# Plot
plt.figure(figsize=(10, 6))
plt.scatter(df_sub['cv'], df_sub['lb'], s=100, alpha=0.7)
plt.plot([0, 0.015], [intercept, slope*0.015 + intercept], 'r--', label=f'LB = {slope:.2f}*CV + {intercept:.4f}')
plt.axhline(y=0.0347, color='g', linestyle=':', label='Target (0.0347)')
plt.xlabel('CV Score')
plt.ylabel('LB Score')
plt.title('CV-LB Relationship Across All Submissions')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('/home/code/exploration/cv_lb_relationship.png', dpi=150)
plt.show()

In [None]:
# Key insight: The intercept (0.0527) is higher than the target (0.0347)
# This means even with CV=0, the predicted LB would be 0.0527

# What would CV need to be to reach target?
cv_for_target = (0.0347 - intercept) / slope
print(f'To reach target LB=0.0347, CV would need to be: {cv_for_target:.6f}')
print(f'This is NEGATIVE, which is impossible!')

# What is the minimum achievable LB with current approach?
print(f'\nMinimum achievable LB (at CV=0): {intercept:.4f}')
print(f'Best CV so far: 0.008199')
print(f'Predicted LB at best CV: {slope * 0.008199 + intercept:.4f}')

In [None]:
# Hypothesis: The CV-LB gap might be due to:
# 1. Different CV scheme (GroupKFold vs leave-one-out)
# 2. Additional test data not in training set
# 3. Model variance between runs

# Let's analyze the structure of the data to understand the gap

# Single solvent: 24 solvents, 656 samples
# Full data: 13 ramps, 1227 samples

print('=== Single Solvent Data ===')
print(f'Total samples: {len(df_single)}')
print(f'Unique solvents: {df_single["SOLVENT NAME"].nunique()}')
print(f'Samples per solvent: {len(df_single) / df_single["SOLVENT NAME"].nunique():.1f}')

print('\n=== Full Data ===')
print(f'Total samples: {len(df_full)}')
ramps = df_full[['SOLVENT A NAME', 'SOLVENT B NAME']].drop_duplicates()
print(f'Unique ramps: {len(ramps)}')
print(f'Samples per ramp: {len(df_full) / len(ramps):.1f}')

In [None]:
# Compare leave-one-out vs GroupKFold(5) for single solvent

# Leave-one-out: 24 folds, each fold tests on 1 solvent (~27 samples)
# GroupKFold(5): 5 folds, each fold tests on ~5 solvents (~131 samples)

print('=== Leave-One-Out CV ===')
print(f'Number of folds: 24')
print(f'Test samples per fold: ~{len(df_single) / 24:.0f}')
print(f'Train samples per fold: ~{len(df_single) * 23 / 24:.0f}')

print('\n=== GroupKFold(5) CV ===')
print(f'Number of folds: 5')
print(f'Test samples per fold: ~{len(df_single) / 5:.0f}')
print(f'Train samples per fold: ~{len(df_single) * 4 / 5:.0f}')

print('\n=== Key Difference ===')
print('Leave-one-out tests on 1 solvent at a time (more granular)')
print('GroupKFold(5) tests on ~5 solvents at a time (less granular)')
print('\nIf Kaggle evaluation uses GroupKFold, our leave-one-out CV might be overly pessimistic')

In [None]:
# Analyze the submission file structure
import os

submission_path = '/home/submission/submission.csv'
if os.path.exists(submission_path):
    sub = pd.read_csv(submission_path)
    print(f'Submission shape: {sub.shape}')
    print(f'\nColumns: {sub.columns.tolist()}')
    print(f'\nTask distribution:')
    print(sub['task'].value_counts())
    print(f'\nFold distribution for task 0 (single solvent):')
    print(sub[sub['task'] == 0]['fold'].value_counts().sort_index())
    print(f'\nFold distribution for task 1 (full data):')
    print(sub[sub['task'] == 1]['fold'].value_counts().sort_index())

In [None]:
# The submission file structure shows:
# - Task 0: Single solvent, 24 folds (leave-one-out)
# - Task 1: Full data, 13 folds (leave-one-ramp-out)

# This confirms that the evaluation uses leave-one-out CV, NOT GroupKFold!
# So the "mixall" kernel's approach of using GroupKFold is just for faster local CV,
# but the actual submission still uses leave-one-out.

print('=== CRITICAL INSIGHT ===')
print('The submission file structure shows 24 folds for single solvent and 13 folds for full data.')
print('This confirms that the Kaggle evaluation uses leave-one-out CV, NOT GroupKFold!')
print('\nThe "mixall" kernel uses GroupKFold for faster local CV, but the submission')
print('still generates predictions for all 24+13 folds using leave-one-out.')
print('\nThis means the CV-LB gap is NOT due to CV scheme mismatch.')

In [None]:
# So what causes the CV-LB gap?
# Possibilities:
# 1. Model variance between runs (different random seeds)
# 2. Additional test data not in training set
# 3. Different evaluation metric (e.g., weighted MSE)
# 4. Overfitting to local CV

# Let's check if there's any pattern in the residuals

residuals = df_sub['lb'] - (slope * df_sub['cv'] + intercept)
print('Residuals from linear fit:')
for i, row in df_sub.iterrows():
    print(f'{row["exp"]}: CV={row["cv"]:.4f}, LB={row["lb"]:.4f}, Predicted={slope * row["cv"] + intercept:.4f}, Residual={residuals[i]:.4f}')

print(f'\nMean absolute residual: {np.abs(residuals).mean():.4f}')
print(f'Max absolute residual: {np.abs(residuals).max():.4f}')

In [None]:
# The residuals are small (mean ~0.002), confirming the linear relationship is strong.
# The CV-LB gap is STRUCTURAL, not random.

# Key insight: The intercept (0.0527) represents a "baseline error" that exists
# regardless of model quality. This could be due to:
# 1. Systematic bias in predictions
# 2. Additional test data with different distribution
# 3. Evaluation on a different metric

# Let's analyze the target distribution to understand potential biases

print('=== Target Distribution ===')
print('\nSingle Solvent:')
for col in ['Product 2', 'Product 3', 'SM']:
    print(f'{col}: mean={df_single[col].mean():.4f}, std={df_single[col].std():.4f}, min={df_single[col].min():.4f}, max={df_single[col].max():.4f}')

print('\nFull Data:')
for col in ['Product 2', 'Product 3', 'SM']:
    print(f'{col}: mean={df_full[col].mean():.4f}, std={df_full[col].std():.4f}, min={df_full[col].min():.4f}, max={df_full[col].max():.4f}')

In [None]:
# Summary of findings:
# 1. CV-LB relationship is highly linear: LB = 4.27 * CV + 0.0527 (R² = 0.967)
# 2. The intercept (0.0527) > target (0.0347), making target IMPOSSIBLE with current approach
# 3. The submission file structure confirms leave-one-out CV is used for evaluation
# 4. The "mixall" kernel's GroupKFold approach is just for faster local CV, not for submission
# 5. The CV-LB gap is STRUCTURAL, not due to CV scheme mismatch

# What can we do?
# 1. Try to reduce the intercept by changing the model architecture
# 2. Try to reduce the slope by improving model generalization
# 3. Try a fundamentally different approach that might have a different CV-LB relationship

print('=== STRATEGIC OPTIONS ===')
print('\n1. SUBMIT BEST CV MODEL (exp_041 with CV 0.008199)')
print('   - Expected LB: ~0.0877 (same as exp_030)')
print('   - Purpose: Verify CV-LB relationship still holds')
print('\n2. TRY XGBoost IN ENSEMBLE')
print('   - The "mixall" kernel uses MLP + XGBoost + RF + LightGBM')
print('   - We have MLP + GP + LGBM, but no XGBoost')
print('   - Adding XGBoost might provide different inductive bias')
print('\n3. TRY DIFFERENT LOSS FUNCTION')
print('   - Current: MSE loss')
print('   - Try: MAE loss, Huber loss, or custom loss')
print('   - Different loss might change the CV-LB relationship')
print('\n4. TRY DOMAIN ADAPTATION')
print('   - The CV-LB gap might be due to distribution shift')
print('   - Domain adaptation techniques might help')