# Loop 65 Analysis: Strategic Assessment

## Key Problem
- Best CV: 0.008194 (exp_032)
- Best LB: 0.0877 (exp_030)
- CV-LB relationship: LB = 4.21 × CV + 0.0535 (R² = 0.98)
- **CRITICAL**: Intercept (0.0535) > Target (0.0347)
- Target: 0.0347

## What This Means
Even with CV=0 (impossible), the predicted LB would be 0.0535 > target.
The current approach CANNOT reach the target by minimizing CV alone.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Submission history
submissions = [
    {'exp': 'exp_000', 'cv': 0.0111, 'lb': 0.0982},
    {'exp': 'exp_001', 'cv': 0.0123, 'lb': 0.1065},
    {'exp': 'exp_003', 'cv': 0.0105, 'lb': 0.0972},
    {'exp': 'exp_005', 'cv': 0.0104, 'lb': 0.0969},
    {'exp': 'exp_006', 'cv': 0.0097, 'lb': 0.0946},
    {'exp': 'exp_007', 'cv': 0.0093, 'lb': 0.0932},
    {'exp': 'exp_009', 'cv': 0.0092, 'lb': 0.0936},
    {'exp': 'exp_012', 'cv': 0.0090, 'lb': 0.0913},
    {'exp': 'exp_024', 'cv': 0.0087, 'lb': 0.0893},
    {'exp': 'exp_026', 'cv': 0.0085, 'lb': 0.0887},
    {'exp': 'exp_030', 'cv': 0.0083, 'lb': 0.0877},
    {'exp': 'exp_041', 'cv': 0.0090, 'lb': 0.0932},
    {'exp': 'exp_042', 'cv': 0.0145, 'lb': 0.1147},
]

df = pd.DataFrame(submissions)
print('Submission History:')
print(df.to_string())

Submission History:
        exp      cv      lb
0   exp_000  0.0111  0.0982
1   exp_001  0.0123  0.1065
2   exp_003  0.0105  0.0972
3   exp_005  0.0104  0.0969
4   exp_006  0.0097  0.0946
5   exp_007  0.0093  0.0932
6   exp_009  0.0092  0.0936
7   exp_012  0.0090  0.0913
8   exp_024  0.0087  0.0893
9   exp_026  0.0085  0.0887
10  exp_030  0.0083  0.0877
11  exp_041  0.0090  0.0932
12  exp_042  0.0145  0.1147


In [2]:
# Fit CV-LB relationship
from scipy import stats

cv = df['cv'].values
lb = df['lb'].values

slope, intercept, r_value, p_value, std_err = stats.linregress(cv, lb)
print(f'\nCV-LB Relationship:')
print(f'LB = {slope:.2f} × CV + {intercept:.4f}')
print(f'R² = {r_value**2:.4f}')
print(f'\nIntercept: {intercept:.4f}')
print(f'Target: 0.0347')
print(f'Gap: {intercept - 0.0347:.4f}')


CV-LB Relationship:
LB = 4.23 × CV + 0.0533
R² = 0.9807

Intercept: 0.0533
Target: 0.0347
Gap: 0.0186


In [3]:
# What CV would we need to hit target?
target = 0.0347
required_cv = (target - intercept) / slope
print(f'\nRequired CV to hit target: {required_cv:.6f}')
if required_cv < 0:
    print('IMPOSSIBLE: Required CV is NEGATIVE!')
    print('The current approach CANNOT reach the target.')


Required CV to hit target: -0.004396
IMPOSSIBLE: Required CV is NEGATIVE!
The current approach CANNOT reach the target.


In [4]:
# Analyze residuals - which submissions beat the trend?
df['predicted_lb'] = slope * df['cv'] + intercept
df['residual'] = df['lb'] - df['predicted_lb']
df['residual_pct'] = df['residual'] / df['predicted_lb'] * 100

print('\nResidual Analysis (negative = better than expected):')
print(df[['exp', 'cv', 'lb', 'predicted_lb', 'residual', 'residual_pct']].to_string())

print(f'\nBest residual: {df["residual"].min():.4f} ({df.loc[df["residual"].idxmin(), "exp"]})')
print(f'Worst residual: {df["residual"].max():.4f} ({df.loc[df["residual"].idxmax(), "exp"]})')


Residual Analysis (negative = better than expected):
        exp      cv      lb  predicted_lb  residual  residual_pct
0   exp_000  0.0111  0.0982      0.100269 -0.002069     -2.062964
1   exp_001  0.0123  0.1065      0.105346  0.001154      1.095494
2   exp_003  0.0105  0.0972      0.097730 -0.000530     -0.542091
3   exp_005  0.0104  0.0969      0.097307 -0.000407     -0.417920
4   exp_006  0.0097  0.0946      0.094345  0.000255      0.270471
5   exp_007  0.0093  0.0932      0.092652  0.000548      0.591085
6   exp_009  0.0092  0.0936      0.092229  0.001371      1.486269
7   exp_012  0.0090  0.0913      0.091383 -0.000083     -0.090811
8   exp_024  0.0087  0.0893      0.090114 -0.000814     -0.902889
9   exp_026  0.0085  0.0887      0.089267 -0.000567     -0.635603
10  exp_030  0.0083  0.0877      0.088421 -0.000721     -0.815582
11  exp_041  0.0090  0.0932      0.091383  0.001817      1.988351
12  exp_042  0.0145  0.1147      0.114655  0.000045      0.039615

Best residual: -0.002

In [5]:
# What approaches have been tried?
approaches_tried = [
    'MLP with Spange features',
    'LightGBM',
    'DRFP features (PCA)',
    'Combined Spange + DRFP + ACS PCA',
    'Deep Residual MLP (FAILED)',
    'Large Ensemble (15 models)',
    'Simpler models [64, 32]',
    'Ridge Regression',
    'Diverse Ensemble (MLP + LightGBM)',
    'GP + MLP + LGBM ensemble',
    'Per-target optimization',
    'Per-solvent-type models (FAILED)',
    'GNN/GAT (FAILED)',
    'ChemBERTa (FAILED)',
    'TabNet (FAILED)',
    'Importance weighting',
    'Mixup augmentation',
    'Uncertainty weighting',
    'Isotonic calibration',
    'Prediction shrinkage',
    'GroupKFold CV',
    'Aggressive regularization',
    'Physical constraints (mass balance)',
    'Conformalized Quantile Regression',
]

print('Approaches Tried (66 experiments):')
for i, approach in enumerate(approaches_tried, 1):
    print(f'{i}. {approach}')

Approaches Tried (66 experiments):
1. MLP with Spange features
2. LightGBM
3. DRFP features (PCA)
4. Combined Spange + DRFP + ACS PCA
5. Deep Residual MLP (FAILED)
6. Large Ensemble (15 models)
7. Simpler models [64, 32]
8. Ridge Regression
9. Diverse Ensemble (MLP + LightGBM)
10. GP + MLP + LGBM ensemble
11. Per-target optimization
12. Per-solvent-type models (FAILED)
13. GNN/GAT (FAILED)
14. ChemBERTa (FAILED)
15. TabNet (FAILED)
16. Importance weighting
17. Mixup augmentation
18. Uncertainty weighting
19. Isotonic calibration
20. Prediction shrinkage
21. GroupKFold CV
22. Aggressive regularization
23. Physical constraints (mass balance)
24. Conformalized Quantile Regression


In [6]:
# What HASN'T been tried?
untried_approaches = [
    '1. Multi-task learning with auxiliary targets (e.g., predict solvent properties)',
    '2. Domain adaptation techniques (e.g., DANN)',
    '3. Meta-learning (MAML) for few-shot adaptation to new solvents',
    '4. Bayesian Neural Networks for uncertainty quantification',
    '5. Neural Process for conditional predictions',
    '6. Prototype networks for solvent similarity',
    '7. Contrastive learning for solvent representations',
    '8. Self-supervised pre-training on solvent data',
    '9. Transfer learning from related chemistry tasks',
    '10. Ensemble of fundamentally different architectures (not just weights)',
]

print('\nPotentially Untried Approaches:')
for approach in untried_approaches:
    print(approach)


Potentially Untried Approaches:
1. Multi-task learning with auxiliary targets (e.g., predict solvent properties)
2. Domain adaptation techniques (e.g., DANN)
3. Meta-learning (MAML) for few-shot adaptation to new solvents
4. Bayesian Neural Networks for uncertainty quantification
5. Neural Process for conditional predictions
6. Prototype networks for solvent similarity
7. Contrastive learning for solvent representations
8. Self-supervised pre-training on solvent data
9. Transfer learning from related chemistry tasks
10. Ensemble of fundamentally different architectures (not just weights)


In [7]:
# Key insight: The problem is OOD generalization
# The test set likely contains solvents NOT in training
# Our CV (leave-one-solvent-out) simulates this but the gap suggests
# the test solvents are MORE different than any single training solvent

print('\n=== KEY INSIGHT ===')
print('The CV-LB gap suggests the test solvents are MORE different from training')
print('than any single training solvent is from the rest.')
print('')
print('This is an EXTRAPOLATION problem, not an INTERPOLATION problem.')
print('')
print('Possible reasons for the gap:')
print('1. Test solvents have properties outside the training range')
print('2. Test solvents belong to different chemical families')
print('3. The model overfits to training solvent patterns')
print('')
print('What might help:')
print('1. Features that generalize better across chemical space')
print('2. Models that are more robust to distribution shift')
print('3. Regularization that prevents overfitting to training solvents')


=== KEY INSIGHT ===
The CV-LB gap suggests the test solvents are MORE different from training
than any single training solvent is from the rest.

This is an EXTRAPOLATION problem, not an INTERPOLATION problem.

Possible reasons for the gap:
1. Test solvents have properties outside the training range
2. Test solvents belong to different chemical families
3. The model overfits to training solvent patterns

What might help:
1. Features that generalize better across chemical space
2. Models that are more robust to distribution shift
3. Regularization that prevents overfitting to training solvents


In [None]:
# Based on research, what approaches might change the CV-LB relationship?

print('=== RESEARCH INSIGHTS ===')
print('')
print('1. BOOM benchmark (2025): "No existing model achieves strong OOD generalization"')
print('   - Even top models have 3x higher OOD error than ID error')
print('   - This matches our CV-LB gap (4.2x multiplier)')
print('')
print('2. QMex + ILR (Nature 2024): "QM descriptors + interaction terms"')
print('   - State-of-the-art extrapolative performance')
print('   - We have Spange descriptors (physicochemical) but not QM descriptors')
print('')
print('3. Meta-learning (Kim 2025): "Leverage unlabeled data to interpolate ID-OOD"')
print('   - Not applicable - we don\'t have unlabeled test solvents')
print('')
print('4. Key insight: "Cluster-based splitting poses hardest challenge (r~0.4)"')
print('   - Our leave-one-solvent-out is similar to cluster-based splitting')
print('   - This explains the weak CV-LB correlation')
print('')
print('=== WHAT WE HAVEN\'T TRIED ===')
print('')
print('1. Interaction terms between features and categorical solvent info')
print('2. Ensemble of models trained on different feature subsets')
print('3. Stacking with meta-learner that learns from OOF predictions')
print('4. Prediction averaging across multiple random seeds (variance reduction)')
print('5. Feature engineering based on solvent similarity to training set')