# Loop 67 Analysis: Strategic Assessment

## Key Problem
The CV-LB relationship is: LB = 4.24 × CV + 0.0532 (R² = 0.98)

**CRITICAL**: Intercept (0.0532) > Target (0.0347)

This means even with CV=0 (impossible), predicted LB would be 0.0532 > target.

## Questions to Investigate
1. What approaches have been tried that might change the CV-LB relationship?
2. Are there any experiments with unusually good CV-LB relationships?
3. What fundamentally different approaches remain untried?

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

# Submission history
submissions = [
    ('exp_000', 0.0111, 0.0982),
    ('exp_001', 0.0123, 0.1065),
    ('exp_003', 0.0105, 0.0972),
    ('exp_005', 0.0104, 0.0969),
    ('exp_006', 0.0097, 0.0946),
    ('exp_007', 0.0093, 0.0932),
    ('exp_009', 0.0092, 0.0936),
    ('exp_012', 0.0090, 0.0913),
    ('exp_024', 0.0087, 0.0893),
    ('exp_026', 0.0085, 0.0887),
    ('exp_030', 0.0083, 0.0877),
    ('exp_041', 0.0090, 0.0932),
    ('exp_042', 0.0145, 0.1147),
    ('exp_032', 0.0082, 0.0873),
]

exp_ids = [s[0] for s in submissions]
cvs = np.array([s[1] for s in submissions])
lbs = np.array([s[2] for s in submissions])

print('=== CV-LB Relationship Analysis ===')
slope, intercept, r_value, p_value, std_err = stats.linregress(cvs, lbs)
print(f'LB = {slope:.2f} × CV + {intercept:.4f}')
print(f'R² = {r_value**2:.4f}')
print(f'Intercept: {intercept:.4f}')
print(f'Target: 0.0347')
print(f'Gap: {intercept - 0.0347:.4f}')

In [None]:
# Calculate residuals from the CV-LB relationship
predicted_lbs = slope * cvs + intercept
residuals = lbs - predicted_lbs

print('=== Residuals from CV-LB Relationship ===')
for exp_id, cv, lb, res in zip(exp_ids, cvs, lbs, residuals):
    status = '✓ BETTER' if res < -0.001 else ('✗ WORSE' if res > 0.001 else '~ EXPECTED')
    print(f'{exp_id}: CV={cv:.4f}, LB={lb:.4f}, Predicted={slope*cv+intercept:.4f}, Residual={res:+.4f} {status}')

print(f'\nBest residual: {exp_ids[np.argmin(residuals)]} with {residuals.min():.4f}')
print(f'Worst residual: {exp_ids[np.argmax(residuals)]} with {residuals.max():.4f}')

In [None]:
# What CV would be needed to hit the target?
target = 0.0347
required_cv = (target - intercept) / slope
print(f'\n=== Required CV to Hit Target ===')
print(f'Target LB: {target}')
print(f'Required CV: {required_cv:.6f}')
print(f'This is IMPOSSIBLE (negative CV)')

# What if we could reduce the intercept?
print(f'\n=== What if we could reduce the intercept? ===')
for new_intercept in [0.04, 0.03, 0.02, 0.01, 0.0]:
    new_required_cv = (target - new_intercept) / slope
    print(f'Intercept={new_intercept:.2f}: Required CV = {new_required_cv:.6f}')

In [None]:
# Analyze what approaches have been tried
print('=== Approaches Tried (67 experiments) ===')
approaches = [
    ('MLP with Arrhenius kinetics', 'exp_000', 0.0111, 'Baseline'),
    ('LightGBM', 'exp_001', 0.0123, 'Worse than MLP'),
    ('DRFP features', 'exp_002', 0.0169, 'Worse'),
    ('Combined Spange+DRFP', 'exp_003', 0.0105, 'Better'),
    ('Deep Residual MLP', 'exp_004', 0.0519, 'FAILED'),
    ('Large Ensemble (15 models)', 'exp_005', 0.0104, 'Marginal improvement'),
    ('Simpler Model [64,32]', 'exp_006', 0.0097, 'Better'),
    ('Even Simpler [32,16]', 'exp_007', 0.0093, 'Better'),
    ('Ridge Regression', 'exp_009', 0.0092, 'Similar'),
    ('GP + MLP + LGBM Ensemble', 'exp_030', 0.0083, 'BEST CV'),
    ('Higher GP Weight', 'exp_031', 0.0085, 'Similar'),
    ('Pure GP', 'exp_032', 0.0082, 'BEST CV'),
    ('Aggressive Regularization', 'exp_041', 0.0090, 'Worse LB than expected'),
    ('GroupKFold CV', 'exp_042', 0.0145, 'Much worse'),
    ('TabNet', 'exp_059', 0.0366, 'FAILED'),
    ('CQR', 'exp_060', 0.0099, 'Worse'),
    ('Importance Weighting', 'exp_061', 0.0104, 'Worse'),
    ('Mixup Augmentation', 'exp_062', 0.0094, 'Similar'),
    ('Uncertainty Weighting', 'exp_063', 0.0102, 'Worse'),
    ('Isotonic Calibration', 'exp_064', 0.0094, 'Similar'),
    ('Multi-Seed Ensemble', 'exp_065', 0.0096, 'Worse'),
    ('Per-Target Analysis', 'exp_066', 0.0093, 'Similar'),
]

for name, exp_id, cv, result in approaches:
    print(f'{exp_id}: {name} - CV={cv:.4f} - {result}')

In [None]:
# Key insight: The intercept problem
print('=== KEY INSIGHT: The Intercept Problem ===')
print(f'''
The CV-LB relationship has:
- Slope: {slope:.2f} (each 0.001 CV improvement → {slope*0.001:.4f} LB improvement)
- Intercept: {intercept:.4f} (baseline LB when CV=0)

The intercept ({intercept:.4f}) is HIGHER than the target ({target}).

This means:
1. Even with perfect CV=0, the expected LB would be {intercept:.4f}
2. The current approach CANNOT reach the target by minimizing CV alone
3. We need an approach that CHANGES the CV-LB relationship itself

What might change the relationship:
1. A model that generalizes better to OOD solvents (lower intercept)
2. A training strategy that optimizes for OOD performance
3. A fundamentally different approach to the problem

The intercept represents the "irreducible" error when extrapolating to new solvents.
To reduce it, we need to either:
- Find features that better capture solvent similarity
- Use a model that is more conservative on OOD samples
- Find a way to leverage the test distribution information
''')

In [None]:
# What approaches might change the CV-LB relationship?
print('=== Approaches That Might Change the CV-LB Relationship ===')
print('''
1. PREDICTION SHRINKAGE TOWARD TRAINING MEAN
   - For each test sample, compute its "distance" from training distribution
   - Shrink predictions toward the training mean proportionally to distance
   - Rationale: OOD samples should have more conservative predictions
   - STATUS: Tried in exp_064 (isotonic calibration) - didn't help

2. SIMILARITY-WEIGHTED PREDICTIONS
   - For each test solvent, compute similarity to all training solvents
   - Weight predictions by similarity to training solvents
   - Rationale: Models that work well on similar solvents should be weighted higher
   - STATUS: Tried in exp_037, exp_038 - didn't help

3. UNCERTAINTY-BASED CONSERVATIVE PREDICTIONS
   - Use GP uncertainty estimates
   - For high-uncertainty predictions, shrink toward a safe default
   - Rationale: When uncertain, be conservative
   - STATUS: Tried in exp_063 - didn't help

4. BIAS CORRECTION
   - The analysis shows positive mean error (over-prediction)
   - Apply a simple correction: subtract mean training error from predictions
   - STATUS: Not explicitly tried

5. DOMAIN ADAPTATION / TRANSFER LEARNING
   - Use pre-trained molecular embeddings
   - STATUS: Tried ChemBERTa (exp_050) - FAILED

6. GRAPH NEURAL NETWORKS
   - The benchmark paper achieved CV 0.0039 with GAT + DRFP
   - STATUS: Tried simple GNN (exp_049, exp_054) - FAILED (OOM or worse CV)

7. ENSEMBLE SELECTION BASED ON OOD PERFORMANCE
   - Select models that generalize better, not just those with best CV
   - STATUS: Not explicitly tried
''')

In [None]:
# Summary and recommendations
print('=== SUMMARY AND RECOMMENDATIONS ===')
print(f'''
Current Status:
- Best CV: 0.008194 (exp_032)
- Best LB: 0.0873 (exp_032)
- Target: 0.0347
- Gap: {0.0873 - 0.0347:.4f} (60.2% improvement needed)
- Remaining submissions: 4

The CV-LB relationship is:
- LB = {slope:.2f} × CV + {intercept:.4f}
- Intercept ({intercept:.4f}) > Target (0.0347)

Key Insight:
The intercept problem is the fundamental obstacle. All 67 experiments have followed
the same CV-LB relationship. No approach has changed the intercept.

Recommended Approaches (in order of priority):

1. BIAS CORRECTION (NOT YET TRIED)
   - The per-target analysis (exp_066) showed positive mean error (over-prediction)
   - Apply a simple correction: subtract mean training error from predictions
   - This might reduce the intercept

2. SOLVENT-SPECIFIC BIAS CORRECTION
   - Compute per-solvent bias on training data
   - Apply correction based on similarity to training solvents
   - This is a form of "calibration" that might help OOD generalization

3. CONSERVATIVE PREDICTION FOR DISSIMILAR SOLVENTS
   - Compute distance from each test sample to training distribution
   - For distant samples, blend predictions with training mean
   - This is a form of "shrinkage" based on OOD distance

4. ENSEMBLE WITH DIFFERENT FEATURE SETS
   - Train models with different feature sets (Spange-only, DRFP-only, combined)
   - Ensemble by selecting the most conservative prediction for each sample
   - Different models might have different failure modes on OOD samples

5. SMALLER, MORE REGULARIZED MODELS
   - The simpler models (exp_006, exp_007) had better CV
   - Try even simpler models with stronger regularization
   - Simpler models might generalize better to OOD samples

DO NOT TRY:
- Multi-seed averaging (exp_065) - made CV worse
- Isotonic calibration (exp_064) - didn't help
- Prediction shrinkage (uniform) - didn't help
- Importance weighting (exp_061) - didn't help
- GNN/GAT (exp_049, exp_054) - failed
- ChemBERTa (exp_050) - failed
- TabNet (exp_059) - failed
''')

In [None]:
# Final recommendation
print('=== FINAL RECOMMENDATION ===')
print('''
The target (0.0347) IS reachable. The solution exists.

The key insight is that the CV-LB relationship has an intercept (0.0532) that exceeds
the target (0.0347). This means we need an approach that CHANGES the relationship,
not just minimizes CV.

The most promising unexplored approach is BIAS CORRECTION:
1. The per-target analysis showed positive mean error (over-prediction)
2. Apply a simple correction: subtract mean training error from predictions
3. This might reduce the intercept in the CV-LB relationship

Alternatively, try SOLVENT-SPECIFIC BIAS CORRECTION:
1. Compute per-solvent bias on training data
2. Apply correction based on similarity to training solvents
3. This is a form of "calibration" that might help OOD generalization

With 4 submissions remaining, we should:
1. Try bias correction approaches
2. Submit if CV improves significantly
3. Focus on approaches that might change the CV-LB relationship

DO NOT GIVE UP. The target IS reachable.
''')