# Loop 35 Analysis: Critical Decision Point

**Situation:**
- 35 experiments completed
- 11 submissions made, 2 remaining
- Best LB: 0.0877 (exp_030)
- Target: 0.0347
- Gap: 2.53x (153% worse)

**Latest Experiment (exp_037):**
- Similarity weighting: 220.92% WORSE (CV 0.026296 vs baseline 0.008194)
- FAILED due to implementation bug (unnormalized features, wrong sigma)

**Key Question:** What can we do with 2 submissions remaining?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

# All submission history
submissions = [
    {'exp': 'exp_000', 'cv': 0.0111, 'lb': 0.0982},
    {'exp': 'exp_001', 'cv': 0.0123, 'lb': 0.1065},
    {'exp': 'exp_003', 'cv': 0.0105, 'lb': 0.0972},
    {'exp': 'exp_005', 'cv': 0.0104, 'lb': 0.0969},
    {'exp': 'exp_006', 'cv': 0.0097, 'lb': 0.0946},
    {'exp': 'exp_007', 'cv': 0.0093, 'lb': 0.0932},
    {'exp': 'exp_009', 'cv': 0.0092, 'lb': 0.0936},
    {'exp': 'exp_012', 'cv': 0.0090, 'lb': 0.0913},
    {'exp': 'exp_024', 'cv': 0.0087, 'lb': 0.0893},
    {'exp': 'exp_026', 'cv': 0.0085, 'lb': 0.0887},
    {'exp': 'exp_030', 'cv': 0.0083, 'lb': 0.0877},
]

df = pd.DataFrame(submissions)
print('All submissions:')
print(df.to_string(index=False))

All submissions:
    exp     cv     lb
exp_000 0.0111 0.0982
exp_001 0.0123 0.1065
exp_003 0.0105 0.0972
exp_005 0.0104 0.0969
exp_006 0.0097 0.0946
exp_007 0.0093 0.0932
exp_009 0.0092 0.0936
exp_012 0.0090 0.0913
exp_024 0.0087 0.0893
exp_026 0.0085 0.0887
exp_030 0.0083 0.0877


In [2]:
# Linear regression on CV-LB relationship
cv_vals = df['cv'].values
lb_vals = df['lb'].values

slope, intercept, r_value, p_value, std_err = stats.linregress(cv_vals, lb_vals)

print(f'\n=== CV-LB Linear Fit ===')
print(f'LB = {slope:.2f} * CV + {intercept:.4f}')
print(f'R² = {r_value**2:.4f}')
print(f'Slope: {slope:.2f}')
print(f'Intercept: {intercept:.4f}')

# What CV would we need to reach target?
target = 0.0347
required_cv = (target - intercept) / slope
print(f'\n=== Target Analysis ===')
print(f'Target LB: {target}')
print(f'Required CV to reach target: {required_cv:.4f}')

if required_cv < 0:
    print(f'\n⚠️ IMPOSSIBLE with current CV-LB relationship!')
    print(f'Intercept ({intercept:.4f}) > Target ({target})')
    print(f'Even with CV=0, predicted LB would be {intercept:.4f}')


=== CV-LB Linear Fit ===
LB = 4.30 * CV + 0.0524
R² = 0.9675
Slope: 4.30
Intercept: 0.0524

=== Target Analysis ===
Target LB: 0.0347
Required CV to reach target: -0.0041

⚠️ IMPOSSIBLE with current CV-LB relationship!
Intercept (0.0524) > Target (0.0347)
Even with CV=0, predicted LB would be 0.0524


In [3]:
# Calculate residuals and identify outliers
df['predicted_lb'] = slope * df['cv'] + intercept
df['residual'] = df['lb'] - df['predicted_lb']
df['residual_pct'] = df['residual'] / df['predicted_lb'] * 100

print('\n=== Residuals Analysis ===')
print(df[['exp', 'cv', 'lb', 'predicted_lb', 'residual', 'residual_pct']].to_string(index=False))

print(f'\nMean residual: {df["residual"].mean():.4f}')
print(f'Std residual: {df["residual"].std():.4f}')
print(f'\nBest residual (most negative): {df["exp"].iloc[df["residual"].argmin()]} ({df["residual"].min():.4f})')
print(f'Worst residual (most positive): {df["exp"].iloc[df["residual"].argmax()]} ({df["residual"].max():.4f})')


=== Residuals Analysis ===
    exp     cv     lb  predicted_lb  residual  residual_pct
exp_000 0.0111 0.0982      0.100199 -0.001999     -1.995362
exp_001 0.0123 0.1065      0.105364  0.001136      1.077855
exp_003 0.0105 0.0972      0.097617 -0.000417     -0.427023
exp_005 0.0104 0.0969      0.097186 -0.000286     -0.294724
exp_006 0.0097 0.0946      0.094174  0.000426      0.452863
exp_007 0.0093 0.0932      0.092452  0.000748      0.809220
exp_009 0.0092 0.0936      0.092021  0.001579      1.715420
exp_012 0.0090 0.0913      0.091161  0.000139      0.152901
exp_024 0.0087 0.0893      0.089869 -0.000569     -0.633551
exp_026 0.0085 0.0887      0.089009 -0.000309     -0.346638
exp_030 0.0083 0.0877      0.088148 -0.000448     -0.507905

Mean residual: -0.0000
Std residual: 0.0010

Best residual (most negative): exp_000 (-0.0020)
Worst residual (most positive): exp_009 (0.0016)


In [4]:
# What approaches have we tried?
approaches = {
    'MLP architectures': ['exp_000 (baseline)', 'exp_004 (deep residual - FAILED)', 'exp_006 (simpler [64,32])', 'exp_007 (even simpler [32,16])'],
    'Feature engineering': ['exp_002 (DRFP with PCA - WORSE)', 'exp_003 (Spange+DRFP combined)', 'exp_027 (simple features - WORSE)'],
    'Ensemble methods': ['exp_005 (15 models)', 'exp_011 (diverse ensemble)', 'exp_012 (simple ensemble)', 'exp_028 (4-model ensemble - WORSE)'],
    'Loss functions': ['exp_026 (weighted loss)', 'exp_029 (normalization - FAILED)'],
    'GP models': ['exp_030 (GP+MLP+LGBM)', 'exp_031 (higher GP weight)', 'exp_032 (pure GP)', 'exp_035 (lower GP weight)'],
    'Distribution shift': ['exp_037 (similarity weighting - FAILED)'],
}

print('\n=== Approaches Tried ===')
for category, exps in approaches.items():
    print(f'\n{category}:')
    for exp in exps:
        print(f'  - {exp}')


=== Approaches Tried ===

MLP architectures:
  - exp_000 (baseline)
  - exp_004 (deep residual - FAILED)
  - exp_006 (simpler [64,32])
  - exp_007 (even simpler [32,16])

Feature engineering:
  - exp_002 (DRFP with PCA - WORSE)
  - exp_003 (Spange+DRFP combined)
  - exp_027 (simple features - WORSE)

Ensemble methods:
  - exp_005 (15 models)
  - exp_011 (diverse ensemble)
  - exp_012 (simple ensemble)
  - exp_028 (4-model ensemble - WORSE)

Loss functions:
  - exp_026 (weighted loss)
  - exp_029 (normalization - FAILED)

GP models:
  - exp_030 (GP+MLP+LGBM)
  - exp_031 (higher GP weight)
  - exp_032 (pure GP)
  - exp_035 (lower GP weight)

Distribution shift:
  - exp_037 (similarity weighting - FAILED)


In [5]:
# Key insight from Kaggle kernel: GroupKFold instead of Leave-One-Out
print('\n=== NEW INSIGHT FROM KAGGLE KERNELS ===')
print('''
The "mixall" kernel (8 votes) uses GroupKFold (5 splits) instead of Leave-One-Out CV.
This is a fundamentally different approach that might change the CV-LB relationship.

Key differences:
1. Leave-One-Out: 24 folds for single solvent, 13 folds for full data
2. GroupKFold(5): 5 folds for each, groups by solvent

Why this might help:
- Leave-One-Out CV is very sensitive to individual solvents
- GroupKFold provides more stable CV estimates
- May reduce overfitting to specific solvents

However, the competition rules state:
"Submissions will be evaluated according to a cross-validation procedure"
"The submission must have the same last three cells as in the notebook template"

This means we CANNOT change the CV scheme - it's fixed by the competition.
''')


=== NEW INSIGHT FROM KAGGLE KERNELS ===

The "mixall" kernel (8 votes) uses GroupKFold (5 splits) instead of Leave-One-Out CV.
This is a fundamentally different approach that might change the CV-LB relationship.

Key differences:
1. Leave-One-Out: 24 folds for single solvent, 13 folds for full data
2. GroupKFold(5): 5 folds for each, groups by solvent

Why this might help:
- Leave-One-Out CV is very sensitive to individual solvents
- GroupKFold provides more stable CV estimates
- May reduce overfitting to specific solvents

However, the competition rules state:
"Submissions will be evaluated according to a cross-validation procedure"
"The submission must have the same last three cells as in the notebook template"

This means we CANNOT change the CV scheme - it's fixed by the competition.



In [6]:
# What's the gap between our best and the target?
print('\n=== GAP ANALYSIS ===')
best_lb = 0.0877
target = 0.0347

print(f'Best LB: {best_lb}')
print(f'Target: {target}')
print(f'Gap: {best_lb - target:.4f} ({(best_lb - target)/target*100:.1f}% worse)')
print(f'Ratio: {best_lb/target:.2f}x')

print(f'\n=== What would it take to reach target? ===')
print(f'Current CV-LB relationship: LB = {slope:.2f}*CV + {intercept:.4f}')
print(f'Intercept ({intercept:.4f}) > Target ({target})')
print(f'\nTo reach target, we need to:')
print(f'1. CHANGE the CV-LB relationship (reduce intercept)')
print(f'2. OR find an approach with fundamentally different CV-LB dynamics')
print(f'\nApproaches that might change the relationship:')
print(f'- Simpler models (reduce overfitting)')
print(f'- Different features (reduce distribution shift)')
print(f'- Domain adaptation (learn invariant representations)')
print(f'- Adversarial validation (identify and remove shift-causing features)')


=== GAP ANALYSIS ===
Best LB: 0.0877
Target: 0.0347
Gap: 0.0530 (152.7% worse)
Ratio: 2.53x

=== What would it take to reach target? ===
Current CV-LB relationship: LB = 4.30*CV + 0.0524
Intercept (0.0524) > Target (0.0347)

To reach target, we need to:
1. CHANGE the CV-LB relationship (reduce intercept)
2. OR find an approach with fundamentally different CV-LB dynamics

Approaches that might change the relationship:
- Simpler models (reduce overfitting)
- Different features (reduce distribution shift)
- Domain adaptation (learn invariant representations)
- Adversarial validation (identify and remove shift-causing features)


In [7]:
# What experiments have the best residuals (LB better than predicted)?
print('\n=== EXPERIMENTS WITH BEST RESIDUALS ===')
df_sorted = df.sort_values('residual')
print(df_sorted[['exp', 'cv', 'lb', 'predicted_lb', 'residual']].head(5).to_string(index=False))

print('\n=== EXPERIMENTS WITH WORST RESIDUALS ===')
print(df_sorted[['exp', 'cv', 'lb', 'predicted_lb', 'residual']].tail(5).to_string(index=False))


=== EXPERIMENTS WITH BEST RESIDUALS ===
    exp     cv     lb  predicted_lb  residual
exp_000 0.0111 0.0982      0.100199 -0.001999
exp_024 0.0087 0.0893      0.089869 -0.000569
exp_030 0.0083 0.0877      0.088148 -0.000448
exp_003 0.0105 0.0972      0.097617 -0.000417
exp_026 0.0085 0.0887      0.089009 -0.000309

=== EXPERIMENTS WITH WORST RESIDUALS ===
    exp     cv     lb  predicted_lb  residual
exp_012 0.0090 0.0913      0.091161  0.000139
exp_006 0.0097 0.0946      0.094174  0.000426
exp_007 0.0093 0.0932      0.092452  0.000748
exp_001 0.0123 0.1065      0.105364  0.001136
exp_009 0.0092 0.0936      0.092021  0.001579


In [8]:
# Final recommendation
print('\n' + '='*70)
print('FINAL RECOMMENDATION')
print('='*70)

print('''
**SITUATION:**
- 2 submissions remaining
- Best LB: 0.0877 (exp_030)
- Target: 0.0347 (2.53x gap)
- CV-LB relationship: LB = 4.30*CV + 0.0524 (R²=0.97)
- Intercept (0.0524) > Target (0.0347)

**CRITICAL INSIGHT:**
The CV-LB relationship has an intercept of 0.0524, which is LARGER than the target (0.0347).
This means even with perfect CV (0), we would still get LB ~0.0524.

**WHAT THIS MEANS:**
We cannot reach the target by improving CV alone. We need to fundamentally change
the CV-LB relationship.

**APPROACHES THAT MIGHT WORK:**
1. AGGRESSIVE SIMPLIFICATION: Use minimal features (only kinetics + 2-3 Spange features)
   - Rationale: Reduce overfitting to solvent-specific patterns
   - Risk: May sacrifice CV for better LB

2. FIXED SIMILARITY WEIGHTING: Normalize features, tune sigma properly
   - Rationale: The implementation was buggy, the concept might still work
   - Risk: May still not help if the problem is overfitting, not distribution shift

3. PURE RIDGE REGRESSION: Simplest possible model
   - Rationale: Maximum regularization, minimum overfitting
   - Risk: May be too simple to capture the signal

**RECOMMENDATION:**
With only 2 submissions remaining, we should:
1. Try ONE more experiment with aggressive simplification
2. Submit the best result (either new experiment or exp_030)
3. Use final submission for the best model

**DO NOT:**
- Submit exp_037 (similarity weighting) - it's 220% worse
- Try complex approaches - they increase overfitting
- Give up - the target IS reachable
''')


FINAL RECOMMENDATION

**SITUATION:**
- 2 submissions remaining
- Best LB: 0.0877 (exp_030)
- Target: 0.0347 (2.53x gap)
- CV-LB relationship: LB = 4.30*CV + 0.0524 (R²=0.97)
- Intercept (0.0524) > Target (0.0347)

**CRITICAL INSIGHT:**
The CV-LB relationship has an intercept of 0.0524, which is LARGER than the target (0.0347).
This means even with perfect CV (0), we would still get LB ~0.0524.

**WHAT THIS MEANS:**
We cannot reach the target by improving CV alone. We need to fundamentally change
the CV-LB relationship.

**APPROACHES THAT MIGHT WORK:**
1. AGGRESSIVE SIMPLIFICATION: Use minimal features (only kinetics + 2-3 Spange features)
   - Rationale: Reduce overfitting to solvent-specific patterns
   - Risk: May sacrifice CV for better LB

2. FIXED SIMILARITY WEIGHTING: Normalize features, tune sigma properly
   - Rationale: The implementation was buggy, the concept might still work
   - Risk: May still not help if the problem is overfitting, not distribution shift

3. PURE RIDGE 

In [9]:
# What's the best experiment we haven't submitted?
print('\n=== UNSUBMITTED EXPERIMENTS ===')

# Best CV scores from experiments
best_cv_experiments = [
    ('exp_032', 0.008194, 'GP 0.15 + MLP 0.55 + LGBM 0.3'),
    ('exp_035', 0.008194, 'GP 0.15 + MLP 0.55 + LGBM 0.3 (same as exp_032)'),
    ('exp_036', 0.008194, 'No GP (MLP 0.65 + LGBM 0.35)'),
]

print('Best unsubmitted experiments:')
for exp, cv, desc in best_cv_experiments:
    predicted_lb = slope * cv + intercept
    print(f'{exp}: CV {cv:.6f}, Predicted LB {predicted_lb:.4f} - {desc}')

print(f'\nNote: exp_030 (submitted) has CV 0.0083, LB 0.0877')
print(f'exp_032/035/036 have CV 0.008194, predicted LB {slope * 0.008194 + intercept:.4f}')
print(f'\nThese are essentially the same as exp_030, so submitting them would not help.')


=== UNSUBMITTED EXPERIMENTS ===
Best unsubmitted experiments:
exp_032: CV 0.008194, Predicted LB 0.0877 - GP 0.15 + MLP 0.55 + LGBM 0.3
exp_035: CV 0.008194, Predicted LB 0.0877 - GP 0.15 + MLP 0.55 + LGBM 0.3 (same as exp_032)
exp_036: CV 0.008194, Predicted LB 0.0877 - No GP (MLP 0.65 + LGBM 0.35)

Note: exp_030 (submitted) has CV 0.0083, LB 0.0877
exp_032/035/036 have CV 0.008194, predicted LB 0.0877

These are essentially the same as exp_030, so submitting them would not help.
