# Loop 30 Analysis: Critical Strategy Review

**Current State:**
- Best CV: 0.008465 (exp_026)
- Best LB: 0.0887 (exp_026)
- Target: 0.01727
- Gap to target: 5.14x
- Submissions remaining: 3

**Key Insight from exp_029:**
The normalization constraint (SM+P2+P3=1) is WRONG. Actual targets:
- Single Solvent: mean=0.7955, range [0.0288, 1.0000]
- Full Data: mean=0.8035, range [0.0112, 1.1233]

**Critical Discovery from Kernel Analysis:**
The `mixall` kernel uses **GroupKFold (5 splits)** instead of Leave-One-Out CV!
This could explain the CV-LB gap - our CV scheme may not match the LB evaluation.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

# Submission history
submissions = [
    {'exp': 'exp_000', 'cv': 0.011081, 'lb': 0.09816},
    {'exp': 'exp_001', 'cv': 0.012297, 'lb': 0.10649},
    {'exp': 'exp_003', 'cv': 0.010501, 'lb': 0.09719},
    {'exp': 'exp_005', 'cv': 0.010430, 'lb': 0.09691},
    {'exp': 'exp_006', 'cv': 0.009749, 'lb': 0.09457},
    {'exp': 'exp_007', 'cv': 0.009262, 'lb': 0.09316},
    {'exp': 'exp_009', 'cv': 0.009192, 'lb': 0.09364},
    {'exp': 'exp_012', 'cv': 0.009004, 'lb': 0.09134},
    {'exp': 'exp_024', 'cv': 0.008689, 'lb': 0.08929},
    {'exp': 'exp_026', 'cv': 0.008465, 'lb': 0.08870},
]

df = pd.DataFrame(submissions)
print('=== Submission History ===')
print(df.to_string(index=False))
print(f'\nTarget LB: 0.01727')
print(f'Best CV: {df["cv"].min():.6f}')
print(f'Best LB: {df["lb"].min():.5f}')
print(f'Gap to target: {df["lb"].min() / 0.01727:.2f}x')

=== Submission History ===
    exp       cv      lb
exp_000 0.011081 0.09816
exp_001 0.012297 0.10649
exp_003 0.010501 0.09719
exp_005 0.010430 0.09691
exp_006 0.009749 0.09457
exp_007 0.009262 0.09316
exp_009 0.009192 0.09364
exp_012 0.009004 0.09134
exp_024 0.008689 0.08929
exp_026 0.008465 0.08870

Target LB: 0.01727
Best CV: 0.008465
Best LB: 0.08870
Gap to target: 5.14x


In [2]:
# Analyze CV-LB relationship
cv = df['cv'].values
lb = df['lb'].values

# Linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(cv, lb)
print(f'=== CV-LB Linear Fit ===')
print(f'LB = {slope:.4f} * CV + {intercept:.4f}')
print(f'R² = {r_value**2:.4f}')
print(f'\nIntercept: {intercept:.4f}')
print(f'Target: 0.01727')
print(f'\nCRITICAL: Intercept ({intercept:.4f}) > Target (0.01727)')
print('This means even with CV=0, predicted LB would be above target!')

# What CV would we need?
required_cv = (0.01727 - intercept) / slope
print(f'\nRequired CV to hit target: {required_cv:.6f}')
print('This is NEGATIVE - mathematically impossible with current relationship!')

=== CV-LB Linear Fit ===
LB = 4.2222 * CV + 0.0533
R² = 0.9618

Intercept: 0.0533
Target: 0.01727

CRITICAL: Intercept (0.0533) > Target (0.01727)
This means even with CV=0, predicted LB would be above target!

Required CV to hit target: -0.008530
This is NEGATIVE - mathematically impossible with current relationship!


In [3]:
# Analyze the gap pattern
df['gap_ratio'] = df['lb'] / df['cv']
df['gap_additive'] = df['lb'] - df['cv']

print('=== Gap Analysis ===')
print(df[['exp', 'cv', 'lb', 'gap_ratio', 'gap_additive']].to_string(index=False))
print(f'\nMean gap ratio: {df["gap_ratio"].mean():.2f}x')
print(f'Mean additive gap: {df["gap_additive"].mean():.4f}')
print(f'\nThe additive gap is relatively stable (~0.08)')
print('This suggests a systematic bias, not random variance')

=== Gap Analysis ===
    exp       cv      lb  gap_ratio  gap_additive
exp_000 0.011081 0.09816   8.858406      0.087079
exp_001 0.012297 0.10649   8.659836      0.094193
exp_003 0.010501 0.09719   9.255309      0.086689
exp_005 0.010430 0.09691   9.291467      0.086480
exp_006 0.009749 0.09457   9.700482      0.084821
exp_007 0.009262 0.09316  10.058303      0.083898
exp_009 0.009192 0.09364  10.187119      0.084448
exp_012 0.009004 0.09134  10.144380      0.082336
exp_024 0.008689 0.08929  10.276211      0.080601
exp_026 0.008465 0.08870  10.478441      0.080235

Mean gap ratio: 9.69x
Mean additive gap: 0.0851

The additive gap is relatively stable (~0.08)
This suggests a systematic bias, not random variance


In [4]:
# Key insight: Our CV is BETTER than the top LB!
print('=== CRITICAL INSIGHT ===')
print(f'Our best CV: 0.008465')
print(f'Top LB (target): 0.01727')
print(f'\nOur CV is {0.01727 / 0.008465:.2f}x BETTER than the target LB!')
print('\nThis means:')
print('1. Our model is actually very good at the CV task')
print('2. The CV-LB gap is the problem, not model quality')
print('3. We need to change something fundamental about our approach')
print('\nPossible explanations:')
print('1. Different CV scheme (GroupKFold vs Leave-One-Out)')
print('2. Different evaluation metric on LB')
print('3. Distribution shift between CV and LB data')
print('4. Our CV is overfitting to the specific fold structure')

=== CRITICAL INSIGHT ===
Our best CV: 0.008465
Top LB (target): 0.01727

Our CV is 2.04x BETTER than the target LB!

This means:
1. Our model is actually very good at the CV task
2. The CV-LB gap is the problem, not model quality
3. We need to change something fundamental about our approach

Possible explanations:
1. Different CV scheme (GroupKFold vs Leave-One-Out)
2. Different evaluation metric on LB
3. Distribution shift between CV and LB data
4. Our CV is overfitting to the specific fold structure


In [5]:
# What approaches have we tried?
print('=== APPROACHES TRIED ===')
approaches = [
    ('Baseline MLP', 'exp_000', 0.011081, 'Arrhenius kinetics + Spange'),
    ('LightGBM', 'exp_001', 0.012297, 'Tree-based alternative'),
    ('DRFP features', 'exp_002', 0.016948, 'Molecular fingerprints'),
    ('Combined features', 'exp_003', 0.010501, 'Spange + DRFP'),
    ('Deep Residual MLP', 'exp_004', 0.051912, 'FAILED - too complex'),
    ('Large ensemble (15)', 'exp_005', 0.010430, 'More models'),
    ('Simpler model [64,32]', 'exp_006', 0.009749, 'Reduced complexity'),
    ('Even simpler [32,16]', 'exp_007', 0.009262, 'Further reduction'),
    ('Ridge regression', 'exp_009', 0.009192, 'Linear baseline'),
    ('Simple ensemble', 'exp_012', 0.009004, 'MLP + LGBM'),
    ('ACS PCA features', 'exp_022', 0.008601, 'New feature set'),
    ('Weighted loss', 'exp_026', 0.008465, 'SM weight 2x'),
    ('Four-model ensemble', 'exp_028', 0.008674, 'Added XGB, CatBoost'),
    ('Normalization', 'exp_029', 0.016180, 'FAILED - wrong constraint'),
]

for name, exp, cv, notes in approaches:
    print(f'{exp}: {name} | CV: {cv:.6f} | {notes}')

=== APPROACHES TRIED ===
exp_000: Baseline MLP | CV: 0.011081 | Arrhenius kinetics + Spange
exp_001: LightGBM | CV: 0.012297 | Tree-based alternative
exp_002: DRFP features | CV: 0.016948 | Molecular fingerprints
exp_003: Combined features | CV: 0.010501 | Spange + DRFP
exp_004: Deep Residual MLP | CV: 0.051912 | FAILED - too complex
exp_005: Large ensemble (15) | CV: 0.010430 | More models
exp_006: Simpler model [64,32] | CV: 0.009749 | Reduced complexity
exp_007: Even simpler [32,16] | CV: 0.009262 | Further reduction
exp_009: Ridge regression | CV: 0.009192 | Linear baseline
exp_012: Simple ensemble | CV: 0.009004 | MLP + LGBM
exp_022: ACS PCA features | CV: 0.008601 | New feature set
exp_026: Weighted loss | CV: 0.008465 | SM weight 2x
exp_028: Four-model ensemble | CV: 0.008674 | Added XGB, CatBoost
exp_029: Normalization | CV: 0.016180 | FAILED - wrong constraint


In [6]:
# What HASN'T been tried?
print('=== UNEXPLORED APPROACHES ===')
print('\n1. Gaussian Process Regression (GP)')
print('   - Mentioned in competition description')
print('   - Works well with small datasets')
print('   - Different inductive bias than NNs')
print('   - May have different CV-LB relationship')
print('\n2. Different CV scheme (GroupKFold)')
print('   - The "mixall" kernel uses GroupKFold(5) instead of LOO')
print('   - This may better match the LB evaluation')
print('   - Could explain the CV-LB gap')
print('\n3. Aggressive feature selection')
print('   - Current: 145 features')
print('   - Try: Top 20-30 by importance')
print('   - May reduce overfitting')
print('\n4. Simpler linear models')
print('   - Ridge/Lasso with fewer features')
print('   - May generalize better')
print('\n5. Multi-task GP')
print('   - Explicitly mentioned in competition description')
print('   - "imputing any missing values using a multi-task GP"')

=== UNEXPLORED APPROACHES ===

1. Gaussian Process Regression (GP)
   - Mentioned in competition description
   - Works well with small datasets
   - Different inductive bias than NNs
   - May have different CV-LB relationship

2. Different CV scheme (GroupKFold)
   - The "mixall" kernel uses GroupKFold(5) instead of LOO
   - This may better match the LB evaluation
   - Could explain the CV-LB gap

3. Aggressive feature selection
   - Current: 145 features
   - Try: Top 20-30 by importance
   - May reduce overfitting

4. Simpler linear models
   - Ridge/Lasso with fewer features
   - May generalize better

5. Multi-task GP
   - Explicitly mentioned in competition description
   - "imputing any missing values using a multi-task GP"


In [7]:
# Strategy recommendation
print('=== RECOMMENDED STRATEGY ===')
print('\nPRIORITY 1: Try Gaussian Process Regression')
print('- Competition explicitly mentions GPs')
print('- Different inductive bias may break CV-LB pattern')
print('- Works well with small datasets')
print('- Implementation: sklearn.gaussian_process.GaussianProcessRegressor')
print('\nPRIORITY 2: Try simpler features + Ridge regression')
print('- Reduce from 145 to ~20-30 features')
print('- Use feature importance from LightGBM')
print('- May reduce overfitting and improve generalization')
print('\nPRIORITY 3: Ensemble GP + MLP + LGBM')
print('- Combine different model types')
print('- GP provides different predictions')
print('- May improve diversity')
print('\nDO NOT TRY:')
print('- Normalization constraints (targets dont sum to 1)')
print('- More complex architectures (already failed)')
print('- More models in ensemble (diminishing returns)')

=== RECOMMENDED STRATEGY ===

PRIORITY 1: Try Gaussian Process Regression
- Competition explicitly mentions GPs
- Different inductive bias may break CV-LB pattern
- Works well with small datasets
- Implementation: sklearn.gaussian_process.GaussianProcessRegressor

PRIORITY 2: Try simpler features + Ridge regression
- Reduce from 145 to ~20-30 features
- Use feature importance from LightGBM
- May reduce overfitting and improve generalization

PRIORITY 3: Ensemble GP + MLP + LGBM
- Combine different model types
- GP provides different predictions
- May improve diversity

DO NOT TRY:
- Normalization constraints (targets dont sum to 1)
- More complex architectures (already failed)
- More models in ensemble (diminishing returns)


In [8]:
# Final summary
print('=== LOOP 30 SUMMARY ===')
print('\nexp_029 (Normalization) FAILED: 91% worse than baseline')
print('Key insight: Targets do NOT sum to 1.0 (mean ~0.80)')
print('\nCurrent best: exp_026 (CV 0.008465, LB 0.0887)')
print('Target: 0.01727 (5.14x gap)')
print('\nThe CV-LB gap is the fundamental problem.')
print('Our CV is already 2x better than the target LB!')
print('\nNext steps:')
print('1. Try Gaussian Process Regression (different inductive bias)')
print('2. Try aggressive feature selection (reduce overfitting)')
print('3. Consider different CV scheme (GroupKFold vs LOO)')
print('\nSubmissions remaining: 3')
print('Only submit if we see fundamentally different behavior.')

=== LOOP 30 SUMMARY ===

exp_029 (Normalization) FAILED: 91% worse than baseline
Key insight: Targets do NOT sum to 1.0 (mean ~0.80)

Current best: exp_026 (CV 0.008465, LB 0.0887)
Target: 0.01727 (5.14x gap)

The CV-LB gap is the fundamental problem.
Our CV is already 2x better than the target LB!

Next steps:
1. Try Gaussian Process Regression (different inductive bias)
2. Try aggressive feature selection (reduce overfitting)
3. Consider different CV scheme (GroupKFold vs LOO)

Submissions remaining: 3
Only submit if we see fundamentally different behavior.
