# Loop 76 Analysis: Critical Insights from Benchmark Paper

## Key Discovery

The benchmark paper (arXiv:2512.19530) achieved **MSE of 0.0039** using a hybrid GNN architecture.
Our target is **0.0347** - which is 9x worse than what the benchmark achieved!

## What the benchmark did differently:
1. **Graph Attention Networks (GATs)** - not simple GCN
2. **DRFP features** - we have this
3. **Learned mixture-aware solvent encodings** - THIS IS THE KEY
4. **Continuous solvent representation** - not categorical

## Our CV-LB Analysis
- All 12 submissions fall on line: LB = 4.29 * CV + 0.0528
- Intercept (0.0528) > Target (0.0347)
- BUT the benchmark achieved 0.0039 - so the target IS reachable!

In [None]:
import numpy as np
import pandas as pd
from scipy import stats

# Submission history
submissions = [
    {'exp': 'exp_000', 'cv': 0.0111, 'lb': 0.0982},
    {'exp': 'exp_001', 'cv': 0.0123, 'lb': 0.1065},
    {'exp': 'exp_003', 'cv': 0.0105, 'lb': 0.0972},
    {'exp': 'exp_005', 'cv': 0.0104, 'lb': 0.0969},
    {'exp': 'exp_006', 'cv': 0.0097, 'lb': 0.0946},
    {'exp': 'exp_007', 'cv': 0.0093, 'lb': 0.0932},
    {'exp': 'exp_009', 'cv': 0.0092, 'lb': 0.0936},
    {'exp': 'exp_012', 'cv': 0.0090, 'lb': 0.0913},
    {'exp': 'exp_024', 'cv': 0.0087, 'lb': 0.0893},
    {'exp': 'exp_026', 'cv': 0.0085, 'lb': 0.0887},
    {'exp': 'exp_030', 'cv': 0.0083, 'lb': 0.0877},
    {'exp': 'exp_035', 'cv': 0.0098, 'lb': 0.0970},
]

df = pd.DataFrame(submissions)
cv = df['cv'].values
lb = df['lb'].values

# Linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(cv, lb)
print(f'CV-LB Relationship: LB = {slope:.2f} * CV + {intercept:.4f}')
print(f'RÂ² = {r_value**2:.4f}')
print(f'\nIntercept: {intercept:.4f}')
print(f'Target: 0.0347')
print(f'\nRequired CV to hit target: {(0.0347 - intercept) / slope:.4f}')

In [None]:
# Key insight: The benchmark achieved MSE 0.0039!
# This means the target of 0.0347 is VERY achievable

print('='*60)
print('BENCHMARK COMPARISON')
print('='*60)
print(f'Benchmark MSE: 0.0039')
print(f'Our Target: 0.0347')
print(f'Our Best LB: 0.0877')
print(f'\nBenchmark is {0.0347/0.0039:.1f}x better than target')
print(f'Our best is {0.0877/0.0347:.1f}x worse than target')
print(f'\nGap to close: {0.0877 - 0.0347:.4f} ({(0.0877-0.0347)/0.0347*100:.1f}%)')

In [None]:
# What the benchmark paper used:
print('='*60)
print('BENCHMARK ARCHITECTURE (arXiv:2512.19530)')
print('='*60)
print('''
1. Graph Attention Networks (GATs) - NOT simple GCN
   - Attention mechanism learns which molecular features matter
   - Better than GCN for capturing complex relationships

2. Differential Reaction Fingerprints (DRFP)
   - We have this! Already using DRFP features

3. LEARNED MIXTURE-AWARE SOLVENT ENCODINGS (KEY!)
   - Not just linear interpolation of solvent features
   - Learn a representation that captures mixture effects
   - This is what we're missing!

4. Continuous solvent representation
   - Treat solvent as continuous, not categorical
   - Enable interpolation/extrapolation
''')

In [None]:
# Why our GNN attempts failed:
print('='*60)
print('WHY OUR GNN ATTEMPTS FAILED')
print('='*60)
print('''
exp_040 (GNN): CV = 0.0256 (3x worse than baseline)
exp_070 (GNN clean): CV = 0.0256 (same)
exp_071 (ChemBERTa): CV = 0.0225 (2.7x worse)

PROBLEMS:
1. Used simple GCN, not GAT with attention
2. No mixture-aware encoding - just linear interpolation
3. No learned solvent representations
4. Possibly wrong architecture for the problem

SOLUTION:
- Implement GAT with attention mechanism
- Add learned mixture-aware encoding layer
- Use DRFP + learned embeddings together
''')

In [None]:
# Priority experiments
print('='*60)
print('PRIORITY EXPERIMENTS')
print('='*60)
print('''
1. HIGHEST PRIORITY: Implement proper GAT with mixture-aware encoding
   - Use PyTorch Geometric GATConv
   - Add learnable mixture embedding layer
   - Combine with DRFP features

2. Alternative: Test-time adaptation
   - Detect extrapolation using nearest neighbor distance
   - Blend predictions toward mean when extrapolating
   - This could reduce the intercept

3. Ensemble with uncertainty weighting
   - Use GP variance to weight predictions
   - Conservative predictions for high-uncertainty cases

DO NOT:
- Try more MLP/LGBM/XGB variants (all on same CV-LB line)
- Try more feature combinations (already optimized)
- Give up (benchmark achieved 0.0039!)
''')

In [None]:
# Record finding
finding = '''Loop 76 Analysis: Benchmark paper (arXiv:2512.19530) achieved MSE 0.0039 using hybrid GNN with GAT + DRFP + learned mixture-aware encodings. 
Our target (0.0347) is 9x worse than benchmark - VERY achievable! 
Key missing piece: LEARNED MIXTURE-AWARE SOLVENT ENCODINGS. 
Our GNN attempts failed because they used simple GCN without attention and no mixture-aware learning.
Priority: Implement proper GAT architecture with learnable mixture embeddings.'''
print(finding)