# Loop 93 Analysis: Strategic Assessment

## Key Findings

1. **CV-LB Relationship**: LB = 4.31 * CV + 0.0525 (R² = 0.95)
2. **Intercept (0.0525) > Target (0.0347)** - Target is mathematically unreachable with current approaches
3. **93 experiments** have been tried, all falling on the same CV-LB line
4. **Best CV**: 0.0083 (exp_030) → **Best LB**: 0.0877
5. **Gap to target**: 152.8%

## Critical Problem

The conservative extrapolation approach (exp_092) CANNOT be validated with CV:
- CV with blending (0.3): 0.014120 (70% worse than baseline)
- CV without blending: 0.010097 (22% worse than baseline)
- Baseline: 0.008298

This is because CV tests on held-out solvents that are SIMILAR to training solvents, while the LB tests on TRULY UNSEEN solvents.

In [1]:
import numpy as np
import pandas as pd
from scipy import stats

# CV-LB data from submissions
submissions = [
    ("exp_000", 0.0111, 0.0982),
    ("exp_001", 0.0123, 0.1065),
    ("exp_003", 0.0105, 0.0972),
    ("exp_005", 0.0104, 0.0969),
    ("exp_006", 0.0097, 0.0946),
    ("exp_007", 0.0093, 0.0932),
    ("exp_009", 0.0092, 0.0936),
    ("exp_012", 0.0090, 0.0913),
    ("exp_024", 0.0087, 0.0893),
    ("exp_026", 0.0085, 0.0887),
    ("exp_030", 0.0083, 0.0877),
    ("exp_035", 0.0098, 0.0970),
]

cvs = np.array([cv for _, cv, _ in submissions])
lbs = np.array([lb for _, _, lb in submissions])

slope, intercept, r_value, p_value, std_err = stats.linregress(cvs, lbs)

print("CV-LB Relationship Analysis")
print("="*50)
print(f"Linear fit: LB = {slope:.4f} * CV + {intercept:.4f}")
print(f"R² = {r_value**2:.4f}")
print(f"Intercept = {intercept:.4f}")
print(f"Target LB = 0.0347")
print(f"Required CV for target = ({0.0347} - {intercept:.4f}) / {slope:.4f} = {(0.0347 - intercept) / slope:.4f}")
print()
print(f"CRITICAL: Intercept ({intercept:.4f}) > Target ({0.0347})")
print("The target is MATHEMATICALLY UNREACHABLE with current approaches!")
print()
print(f"Best CV: 0.0083 (exp_030)")
print(f"Best LB: 0.0877 (exp_030)")
print(f"Gap to target: {(0.0877 - 0.0347) / 0.0347 * 100:.1f}%")

CV-LB Relationship Analysis
Linear fit: LB = 4.3147 * CV + 0.0525
R² = 0.9505
Intercept = 0.0525
Target LB = 0.0347
Required CV for target = (0.0347 - 0.0525) / 4.3147 = -0.0041

CRITICAL: Intercept (0.0525) > Target (0.0347)
The target is MATHEMATICALLY UNREACHABLE with current approaches!

Best CV: 0.0083 (exp_030)
Best LB: 0.0877 (exp_030)
Gap to target: 152.7%


## What We've Learned

### 1. All Tabular Models Fall on the Same Line
- MLP, LightGBM, XGBoost, CatBoost, GP, Ridge - ALL produce the same CV-LB relationship
- This means the problem is NOT the model - it's DISTRIBUTION SHIFT

### 2. GNN/ChemBERTa Attempts Failed
- 5+ GNN experiments achieved CV 0.018-0.026 (2-3x worse than baseline)
- The benchmark paper achieved MSE 0.0039 with GNNs - 5-6x better than our GNNs
- Possible reasons: implementation issues, no pre-training, wrong architecture

### 3. Conservative Extrapolation Cannot Be Validated
- The approach is designed to help on TRULY UNSEEN solvents
- CV tests on held-out solvents that are SIMILAR to training
- Any approach that helps on truly unseen solvents will HURT CV

### 4. The Validation Paradox
- We cannot validate intercept-reduction strategies with CV
- With only 4 submissions remaining, we cannot afford to "guess"
- The only path forward is approaches that improve BOTH CV and (hopefully) intercept

## Key Insights from Public Kernels

### 1. MixAll Kernel (lishellliang)
- Uses **GroupKFold (5 splits)** instead of Leave-One-Out
- Claims "good CV/LB correlation"
- Ensemble of MLP + XGBoost + RF + LightGBM
- Runtime: only 2m 15s

### 2. Ens Model Kernel (matthewmaree)
- Uses **CatBoost + XGBoost** ensemble
- Different weights for single (7:6) vs full (1:2)
- Combines ALL feature sources (spange, acs_pca, drfps, fragprints, smiles)
- Correlation-based feature filtering with priority
- Yield renormalization (clip to 0, normalize sum ≤ 1)

### 3. System Malfunction V1 (omarafik)
- Simple MLP baseline
- Uses standard Leave-One-Out CV
- 29 votes - popular but basic

## Strategic Assessment

### The Fundamental Problem
The CV-LB intercept (0.0525) is ABOVE the target (0.0347). This means:
1. Even with perfect CV (0.0), expected LB would be 0.0525
2. No amount of model tuning can fix this
3. We need to CHANGE THE RELATIONSHIP, not improve CV

### What Could Change the Relationship?
1. **Pre-trained molecular models** - ChemBERTa, MolBERT, ChemProp
   - These have learned chemistry from millions of molecules
   - May generalize better to unseen solvents
   
2. **Proper GNN implementation** - The benchmark paper achieved 0.0039
   - Our GNN attempts were 5-6x worse
   - Need to investigate why
   
3. **Domain constraints** - Physics-based rules that hold even on unseen data
   - Mass balance constraints
   - Thermodynamic constraints
   
4. **Pseudo-labeling** - Use confident test predictions to augment training
   - Adapt to test distribution

### What WON'T Change the Relationship?
- More MLP/LGBM/XGB variants (92 experiments exhausted this)
- More feature engineering without changing prediction strategy
- Multi-seed ensembles for variance reduction
- Conservative blending (cannot be validated)

## Recommendations for Next Experiment

### PRIORITY 1: Investigate GNN Failures
The benchmark paper achieved MSE 0.0039 with GNNs. Our GNN attempts achieved CV ~0.018-0.026. This 5-6x gap is suspicious.

Questions to investigate:
1. Did the GNN submission cells use the SAME model class as CV computation?
2. What specific architecture did the benchmark paper use? (GAT with DRFP and learned mixture encodings)
3. Did we use pre-trained molecular embeddings?

### PRIORITY 2: Try Pre-trained Molecular Models
Options that may generalize better to unseen solvents:
- **ChemProp**: Pre-trained on millions of molecules, provides molecular embeddings
- **MolBERT/ChemBERTa**: Pre-trained molecular transformers
- Use these as FEATURE EXTRACTORS, not end-to-end models

### PRIORITY 3: Implement Proper GAT Architecture
Based on the benchmark paper:
- Graph Attention Networks (GAT) for molecular graphs
- DRFP features for reaction encoding
- Learned mixture-aware solvent encodings (not just linear interpolation)

### DO NOT DO:
- ❌ More tabular model variants
- ❌ More conservative blending variants (cannot be validated)
- ❌ Experiments that are worse than CV=0.008298