# Loop 58 Analysis: Critical Strategic Assessment

**Key Finding from exp_059**: Physical constraint normalization provided only 0.05% improvement.
- Only ~4% of predictions violated the constraint (sum > 1)
- The base model CV (0.009611) was 17.30% worse than best CV (0.008194)
- This approach is NOT the solution to the CV-LB gap

**Critical Situation**:
- Best CV: 0.008194 (exp_035)
- Best LB: 0.0877 (exp_030)
- Target: 0.0347
- CV-LB relationship: LB = 4.23×CV + 0.0533 (R²=0.98)
- **Intercept (0.0533) > Target (0.0347)** - Cannot reach target by CV minimization alone!
- Only 3 submissions remaining

In [None]:
import numpy as np
import pandas as pd
from scipy import stats

# Submission history
submissions = [
    {'exp': 'exp_000', 'cv': 0.011081, 'lb': 0.09816},
    {'exp': 'exp_001', 'cv': 0.012297, 'lb': 0.10649},
    {'exp': 'exp_003', 'cv': 0.010501, 'lb': 0.09719},
    {'exp': 'exp_005', 'cv': 0.01043, 'lb': 0.09691},
    {'exp': 'exp_006', 'cv': 0.009749, 'lb': 0.09457},
    {'exp': 'exp_007', 'cv': 0.009262, 'lb': 0.09316},
    {'exp': 'exp_009', 'cv': 0.009192, 'lb': 0.09364},
    {'exp': 'exp_012', 'cv': 0.009004, 'lb': 0.09134},
    {'exp': 'exp_024', 'cv': 0.008689, 'lb': 0.08929},
    {'exp': 'exp_026', 'cv': 0.008465, 'lb': 0.08875},
    {'exp': 'exp_030', 'cv': 0.008298, 'lb': 0.08772},
    {'exp': 'exp_041', 'cv': 0.009002, 'lb': 0.09321},
    {'exp': 'exp_042', 'cv': 0.014503, 'lb': 0.11465},
]

df = pd.DataFrame(submissions)
print(f'Total submissions: {len(df)}')
print(df)

In [None]:
# Fit linear regression: LB = slope * CV + intercept
cv = df['cv'].values
lb = df['lb'].values

slope, intercept, r_value, p_value, std_err = stats.linregress(cv, lb)

print(f'\n=== CV-LB Relationship ===')
print(f'LB = {slope:.4f} * CV + {intercept:.4f}')
print(f'R² = {r_value**2:.4f}')
print(f'\nTarget LB: 0.0347')
print(f'Intercept: {intercept:.4f}')
print(f'\n*** CRITICAL: Intercept ({intercept:.4f}) > Target (0.0347) ***')
print(f'Even with CV=0, predicted LB would be {intercept:.4f}')

In [None]:
# Analyze residuals - which experiments generalize best?
df['predicted_lb'] = slope * df['cv'] + intercept
df['residual'] = df['lb'] - df['predicted_lb']

print('=== RESIDUAL ANALYSIS ===')
print('Negative residual = better generalization than expected')
print('Positive residual = worse generalization than expected')
print()
print(df[['exp', 'cv', 'lb', 'predicted_lb', 'residual']].sort_values('residual'))

In [None]:
# What would it take to reach the target?
print('=== WHAT WOULD IT TAKE TO REACH TARGET? ===')
print()

# Option 1: Reduce intercept
required_intercept = 0.0347 - slope * 0.008194  # With best CV
print(f'Option 1: Reduce intercept (with best CV 0.008194)')
print(f'  Required intercept: {required_intercept:.6f}')
print(f'  Current intercept: {intercept:.6f}')
print(f'  Reduction needed: {(intercept - required_intercept) / intercept * 100:.1f}%')
print()

# Option 2: Reduce slope
required_slope = (0.0347 - intercept) / 0.008194  # With current intercept
print(f'Option 2: Reduce slope (with current intercept)')
print(f'  Required slope: {required_slope:.4f} (NEGATIVE - impossible!)')
print()

# Option 3: Both
print(f'Option 3: Need BOTH lower CV AND lower intercept')
print(f'  Example: CV=0.004, intercept=0.018 -> LB = 4.23*0.004 + 0.018 = 0.0349')

In [None]:
# Analyze what's different about best-generalizing experiments
print('=== BEST GENERALIZING EXPERIMENTS ===')
best_gen = df.nsmallest(3, 'residual')
for _, row in best_gen.iterrows():
    print(f"  {row['exp']}: CV={row['cv']:.6f}, LB={row['lb']:.5f}, Residual={row['residual']:.5f}")

print('\n=== WORST GENERALIZING EXPERIMENTS ===')
worst_gen = df.nlargest(3, 'residual')
for _, row in worst_gen.iterrows():
    print(f"  {row['exp']}: CV={row['cv']:.6f}, LB={row['lb']:.5f}, Residual={row['residual']:.5f}")

In [None]:
# Key insight: exp_000 (Spange only, 18 features) had BEST generalization
# despite worse CV than later experiments

print('=== KEY INSIGHT ===')
print('exp_000 (Spange only, 18 features) had BEST generalization residual (-0.0021)')
print('exp_030 (best LB) had residual -0.0007')
print()
print('This suggests:')
print('1. Simpler features may generalize better')
print('2. Adding more features (DRFP, ACS PCA) improved CV but not LB')
print('3. The CV-LB gap is partly due to overfitting to training distribution')

In [None]:
# What approaches have NOT been tried that could change the CV-LB relationship?
print('=== UNTRIED APPROACHES THAT COULD CHANGE CV-LB RELATIONSHIP ===')
print()
print('1. DOMAIN ADAPTATION / TRANSFER LEARNING')
print('   - Train on single solvent, fine-tune on mixtures')
print('   - Or vice versa')
print('   - Could reduce distribution shift')
print()
print('2. ADVERSARIAL VALIDATION')
print('   - Identify features that distinguish train/test distributions')
print('   - Remove or down-weight these features')
print('   - Could reduce overfitting to training distribution')
print()
print('3. DIFFERENT ENSEMBLE STRATEGY')
print('   - Current: Fixed weights (GP 0.15, MLP 0.55, LGBM 0.30)')
print('   - Try: Per-sample adaptive weights based on uncertainty')
print('   - Or: Blending with out-of-fold predictions')
print()
print('4. SIMPLER MODEL WITH BETTER GENERALIZATION')
print('   - exp_000 (Spange only) had best generalization')
print('   - Try: GP only with Spange features')
print('   - Or: Ridge regression with Spange features')

In [None]:
# The REAL question: What is the server doing differently?
print('=== THE REAL QUESTION: WHAT IS THE SERVER DOING DIFFERENTLY? ===')
print()
print('The competition says submissions are evaluated using CV on the server.')
print('If the same CV procedure is used, local CV should match LB.')
print('The 10x gap suggests something is different.')
print()
print('Possibilities:')
print('1. Different random seeds -> different train/test splits')
print('2. Different data preprocessing')
print('3. Different CV scheme (GroupKFold vs Leave-One-Out)')
print('4. Additional held-out data not in public dataset')
print('5. Different weighting of single vs full data')
print()
print('The template notebook shows:')
print('- Leave-One-Solvent-Out for single solvent (24 folds)')
print('- Leave-One-Ramp-Out for full data (13 folds)')
print('- This matches our local CV')
print()
print('So the gap is likely due to:')
print('- Model variance (different random seeds)')
print('- Or additional held-out data')

In [None]:
# STRATEGIC DECISION: What to do with 3 remaining submissions?
print('=== STRATEGIC DECISION: 3 SUBMISSIONS REMAINING ===')
print()
print('Current best LB: 0.0877 (exp_030)')
print('Target: 0.0347')
print('Gap: 152.7%')
print()
print('Options:')
print()
print('OPTION A: Submit best CV model (exp_035, CV 0.008194)')
print('  - Predicted LB: 4.23 * 0.008194 + 0.0533 = 0.0880')
print('  - Likely similar to exp_030 (LB 0.0877)')
print('  - Low risk, low reward')
print()
print('OPTION B: Submit simpler model (Spange only)')
print('  - exp_000 had best generalization residual')
print('  - Could have lower intercept')
print('  - Medium risk, medium reward')
print()
print('OPTION C: Try fundamentally different approach')
print('  - Domain adaptation, adversarial validation, etc.')
print('  - High risk, high reward')
print('  - But we have limited time to implement')
print()
print('RECOMMENDATION: Focus on approaches that could CHANGE the CV-LB relationship')
print('Not just minimize CV, but improve generalization')

In [None]:
# Final analysis: What's the most promising path forward?
print('=== MOST PROMISING PATH FORWARD ===')
print()
print('Given the constraints (3 submissions, limited time), the best strategy is:')
print()
print('1. ACCEPT that we cannot reach target (0.0347) with current approach')
print('   - Intercept (0.0533) > Target (0.0347)')
print('   - Even perfect CV would give LB ~0.0533')
print()
print('2. FOCUS on approaches that could change the CV-LB relationship:')
print('   a. Simpler features (Spange only) - exp_000 had best generalization')
print('   b. Different model family (pure GP, pure Ridge)')
print('   c. Uncertainty-weighted ensemble')
print()
print('3. DO NOT waste submissions on:')
print('   - Marginal CV improvements')
print('   - GNN approaches (consistently fail)')
print('   - Complex ensembles (add variance, not generalization)')
print()
print('4. SAVE at least 1 submission for final attempt')
print('   - Use 1-2 submissions to test different approaches')
print('   - Final submission should be the best-generalizing model')