# Loop 38 Analysis: k-NN Failed - What Next?

**Current State:**
- Best CV: 0.008194 (exp_032)
- Best LB: 0.0877 (exp_030)
- Target: 0.0347
- CV-LB relationship: LB = 4.27×CV + 0.0527 (R²=0.967)

**Latest Experiment (exp_040):**
- k-NN with k=5, distance-weighted
- CV: 0.026414 (222% WORSE than best)
- k-NN is NOT suitable for this problem

**Key Insight from Public Kernels:**
The 'mixall' kernel uses GroupKFold(5) instead of leave-one-out CV. This might explain the CV-LB gap - the evaluation might use a different CV scheme.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

# All submissions data
submissions = [
    {'exp': 'exp_000', 'cv': 0.011081, 'lb': 0.09816},
    {'exp': 'exp_001', 'cv': 0.012297, 'lb': 0.10649},
    {'exp': 'exp_003', 'cv': 0.010501, 'lb': 0.09719},
    {'exp': 'exp_005', 'cv': 0.010430, 'lb': 0.09691},
    {'exp': 'exp_006', 'cv': 0.009749, 'lb': 0.09457},
    {'exp': 'exp_007', 'cv': 0.009262, 'lb': 0.09316},
    {'exp': 'exp_009', 'cv': 0.009192, 'lb': 0.09364},
    {'exp': 'exp_012', 'cv': 0.009004, 'lb': 0.09134},
    {'exp': 'exp_024', 'cv': 0.008689, 'lb': 0.08929},
    {'exp': 'exp_026', 'cv': 0.008465, 'lb': 0.08875},
    {'exp': 'exp_030', 'cv': 0.008298, 'lb': 0.08772},
]

df = pd.DataFrame(submissions)
print(f'Total submissions: {len(df)}')
print(df)

In [None]:
# Fit linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(df['cv'], df['lb'])
print(f'\nCV-LB Relationship:')
print(f'  LB = {slope:.2f} × CV + {intercept:.4f}')
print(f'  R² = {r_value**2:.4f}')
print(f'  Intercept = {intercept:.4f}')
print(f'  Target = 0.0347')
print(f'  Intercept / Target = {intercept / 0.0347:.2f}x')

# What CV would we need to reach target?
cv_needed = (0.0347 - intercept) / slope
print(f'\nTo reach target LB = 0.0347:')
print(f'  CV needed = {cv_needed:.6f}')
if cv_needed < 0:
    print(f'  IMPOSSIBLE with current approach (would need negative CV)')

In [None]:
# Key insight: The 'mixall' kernel uses GroupKFold(5) instead of leave-one-out
# This might explain the CV-LB gap

print('=== KEY INSIGHT FROM PUBLIC KERNELS ===')
print()
print('The "mixall" kernel uses GroupKFold(5) instead of leave-one-out CV.')
print('This is a significant change to the validation strategy.')
print()
print('Possible explanations for CV-LB gap:')
print('1. The evaluation uses a different CV procedure than leave-one-out')
print('2. There is additional test data not in our training set')
print('3. The evaluation uses a different random seed')
print()
print('If the evaluation uses GroupKFold(5), our leave-one-out CV might be')
print('overly pessimistic (more folds = more variance in estimates).')

In [None]:
# Analyze what approaches have been tried
print('=== APPROACHES TRIED ===')
print()
approaches = [
    ('MLP (baseline)', 'exp_000', 0.011081, 'Works well'),
    ('LightGBM', 'exp_001', 0.012297, 'Slightly worse than MLP'),
    ('DRFP + PCA', 'exp_002', 0.016948, 'Much worse'),
    ('Spange + DRFP combined', 'exp_003', 0.010501, 'Better than baseline'),
    ('Deep Residual MLP', 'exp_004', 0.051912, 'FAILED - too complex'),
    ('Large Ensemble (15 models)', 'exp_005', 0.010430, 'Marginal improvement'),
    ('Simpler MLP [64, 32]', 'exp_006', 0.009749, 'BETTER - simpler is better'),
    ('Even Simpler [32, 16]', 'exp_008', 0.009262, 'BETTER'),
    ('Ridge Regression', 'exp_009', 0.009192, 'Comparable to MLP'),
    ('MLP + LGBM ensemble', 'exp_012', 0.009004, 'BETTER'),
    ('ACS PCA features', 'exp_024', 0.008689, 'BETTER'),
    ('Weighted loss', 'exp_026', 0.008465, 'BETTER'),
    ('GP + MLP + LGBM', 'exp_030', 0.008298, 'BEST LB'),
    ('Higher GP weight', 'exp_031', 0.009174, 'WORSE'),
    ('Pure GP', 'exp_032', 0.008194, 'BEST CV'),
    ('Feature selection', 'exp_036', 0.009573, 'WORSE'),
    ('k-NN', 'exp_040', 0.026414, 'MUCH WORSE'),
]

for name, exp, cv, result in approaches:
    print(f'{exp}: {name}')
    print(f'  CV={cv:.6f} - {result}')
    print()

In [None]:
# What's the best path forward?
print('=== STRATEGIC OPTIONS ===')
print()
print('OPTION 1: Try GroupKFold(5) locally')
print('  - If CV scores change significantly, this might explain the CV-LB gap')
print('  - Could reveal that our leave-one-out CV is overly pessimistic')
print()
print('OPTION 2: Submit exp_032 (best CV)')
print('  - CV: 0.008194 (best)')
print('  - Predicted LB: 4.27 × 0.008194 + 0.0527 = 0.0877')
print('  - This is the same as exp_030 LB, so unlikely to improve')
print()
print('OPTION 3: Try a completely different approach')
print('  - Solvent clustering + per-cluster models')
print('  - Adversarial validation to identify distribution shift')
print('  - Meta-learning / MAML')
print()
print('OPTION 4: Focus on reducing the intercept')
print('  - The intercept (0.0527) is the bottleneck')
print('  - Need to find an approach that has a lower intercept')
print('  - This requires understanding WHY the intercept exists')

In [None]:
# The key question: What causes the intercept?
print('=== WHAT CAUSES THE INTERCEPT? ===')
print()
print('The intercept (0.0527) represents the LB score when CV = 0.')
print('This is impossible in practice, but it tells us something important:')
print()
print('Possible causes:')
print('1. Distribution shift between train and test solvents')
print('   - The test solvents are fundamentally different from training solvents')
print('   - Our models learn patterns that dont generalize')
print()
print('2. Different CV procedure in evaluation')
print('   - If evaluation uses GroupKFold(5), our leave-one-out CV is different')
print('   - The intercept might be an artifact of this mismatch')
print()
print('3. Additional test data not in our training set')
print('   - The evaluation might include solvents we havent seen')
print('   - Our models cant extrapolate to these new solvents')
print()
print('4. Overfitting to the training distribution')
print('   - Our models are too specialized to the training solvents')
print('   - Need more regularization or simpler models')

In [None]:
# Recommendation
print('=== RECOMMENDATION ===')
print()
print('Given the current state:')
print('- Best CV: 0.008194 (exp_032)')
print('- Best LB: 0.0877 (exp_030)')
print('- Target: 0.0347')
print('- Gap: 2.53x')
print('- Submissions remaining: 5')
print()
print('The CV-LB relationship is highly linear (R² = 0.967).')
print('All approaches follow the same pattern.')
print('The intercept (0.0527) > target (0.0347) means we CANNOT reach target')
print('with the current approach, no matter how much we improve CV.')
print()
print('PRIORITY 1: Try GroupKFold(5) locally')
print('  - This is a quick experiment that could reveal the CV-LB gap cause')
print('  - If CV scores change significantly, we might have found the issue')
print()
print('PRIORITY 2: Try a fundamentally different approach')
print('  - Solvent clustering + per-cluster models')
print('  - Domain adaptation techniques')
print('  - Meta-learning / MAML')
print()
print('PRIORITY 3: Submit exp_032 (best CV) to verify CV-LB relationship')
print('  - This uses 1 submission but gives us more data points')
print('  - Could reveal if the relationship has changed')