# Loop 47 Analysis: Critical Assessment After 46 Experiments

## Current Situation:
- Best CV: 0.008298 (exp_030)
- Best LB: 0.0877 (exp_030)
- Target: 0.0347
- Gap: 2.53x (153% improvement needed)
- Submissions remaining: 5

## Key Problem:
- CV-LB relationship: LB = 4.31*CV + 0.0525 (R²=0.95)
- Intercept (0.0525) > Target (0.0347)
- This means even CV=0 would give LB=0.0525 > Target

## What's Been Tried (and Failed):
- GNN (AttentiveFP): 8.4x worse
- ChemBERTa: 25.5% worse
- Stronger regularization: 22% worse
- Mean reversion: 6.5% worse
- Adaptive weighting: 3.6% worse
- Similarity weighting: 220% worse
- Minimal features: 19.9% worse
- Pure GP: 4.8x worse

## What's Working:
- GP + MLP + LGBM ensemble (exp_030): Best LB 0.0877
- Spange descriptors + Arrhenius kinetics
- Simple MLP [32, 16] architecture

In [None]:
# Load submission history and analyze the situation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

# Submission history
submissions = [
    {'exp': 'exp_000', 'cv': 0.0111, 'lb': 0.0982},
    {'exp': 'exp_001', 'cv': 0.0123, 'lb': 0.1065},
    {'exp': 'exp_003', 'cv': 0.0105, 'lb': 0.0972},
    {'exp': 'exp_005', 'cv': 0.0104, 'lb': 0.0969},
    {'exp': 'exp_006', 'cv': 0.0097, 'lb': 0.0946},
    {'exp': 'exp_007', 'cv': 0.0093, 'lb': 0.0932},
    {'exp': 'exp_009', 'cv': 0.0092, 'lb': 0.0936},
    {'exp': 'exp_012', 'cv': 0.0090, 'lb': 0.0913},
    {'exp': 'exp_024', 'cv': 0.0087, 'lb': 0.0893},
    {'exp': 'exp_026', 'cv': 0.0085, 'lb': 0.0887},
    {'exp': 'exp_030', 'cv': 0.0083, 'lb': 0.0877},
    {'exp': 'exp_035', 'cv': 0.0098, 'lb': 0.0970},
]

df = pd.DataFrame(submissions)
print('Submission history:')
print(df)
print()

# Linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(df['cv'], df['lb'])
print(f'CV-LB Relationship: LB = {slope:.2f} * CV + {intercept:.4f}')
print(f'R² = {r_value**2:.4f}')
print(f'Intercept = {intercept:.4f}')
print(f'Target = 0.0347')
print()
print(f'CRITICAL: Intercept ({intercept:.4f}) > Target (0.0347)')

In [None]:
# What would it take to reach the target?
target = 0.0347
best_lb = 0.0877
best_cv = 0.0083

print('=== PATH TO TARGET ===')
print()
print(f'Current best LB: {best_lb:.4f}')
print(f'Target: {target:.4f}')
print(f'Gap: {(best_lb - target) / target * 100:.1f}%')
print()
print('Option 1: Improve CV (current approach)')
required_cv = (target - intercept) / slope
print(f'  Required CV: {required_cv:.6f}')
if required_cv < 0:
    print('  IMPOSSIBLE: Required CV is negative!')
print()
print('Option 2: Change the CV-LB relationship')
print('  Need to reduce the intercept from 0.0525 to < 0.0347')
print('  OR change the slope to make CV improvements more impactful')
print()
print('Option 3: Find a fundamentally different approach')
print('  The GNN benchmark achieved MSE 0.0039')
print('  This is 22x better than our best LB')
print('  There IS a path to much better performance')

In [None]:
# Analyze what's different about the GNN benchmark
print('=== GNN BENCHMARK ANALYSIS ===')
print()
print('GNN benchmark (arxiv:2512.19530):')
print('  - MSE: 0.0039')
print('  - Architecture: Graph Attention Network')
print('  - Features: DRFP + learned mixture encodings from graph structure')
print('  - Key: Message-passing neural networks on molecular graphs')
print()
print('Our GNN attempt (exp_040):')
print('  - MSE: 0.068767 (8.4x worse than baseline)')
print('  - Architecture: AttentiveFP (single fold test)')
print('  - Problem: Quick test, minimal tuning, single fold')
print()
print('Why the gap?')
print('  1. The benchmark may have used a different CV scheme')
print('  2. The benchmark may have used more training data')
print('  3. The benchmark may have used different hyperparameters')
print('  4. The benchmark may have used pre-training')
print()
print('Key insight: The GNN benchmark proves much better performance is POSSIBLE')

In [None]:
# What approaches haven't been fully explored?
print('=== UNEXPLORED APPROACHES ===')
print()
print('1. PROPER GNN IMPLEMENTATION')
print('   - exp_040 was a quick test on single fold')
print('   - Need proper hyperparameter tuning')
print('   - Need full CV evaluation')
print('   - Need more epochs and better architecture')
print()
print('2. TRANSFER LEARNING / PRE-TRAINING')
print('   - Pre-train on mixture data, fine-tune on single solvents')
print('   - Use auxiliary tasks to improve representations')
print('   - Competition rules allow different hyperparameters for different tasks')
print()
print('3. SOLVENT CLUSTERING + SPECIALIZED MODELS')
print('   - Cluster solvents by chemical properties')
print('   - Train specialized models for each cluster')
print('   - Use cluster-specific features')
print()
print('4. ENSEMBLE DIVERSITY')
print('   - Current ensemble: GP + MLP + LGBM (all use same features)')
print('   - Try: Different feature sets for different models')
print('   - Try: Models trained on different subsets of data')
print()
print('5. PHYSICS-INFORMED FEATURES')
print('   - Arrhenius kinetics (already using)')
print('   - Solvent-solute interaction energies')
print('   - Transition state theory features')

In [None]:
# Load data and analyze the hardest solvents
import sys
sys.path.insert(0, '/home/code')

DATA_PATH = '/home/data'
df_single = pd.read_csv(f'{DATA_PATH}/catechol_single_solvent_yields.csv')
X_single = df_single[['Residence Time', 'Temperature', 'SOLVENT NAME']]
Y_single = df_single[['SM', 'Product 2', 'Product 3']]

print(f'Single solvent data: {len(df_single)} samples')
print(f'Number of unique solvents: {X_single["SOLVENT NAME"].nunique()}')
print()

# Load Spange descriptors
spange_df = pd.read_csv(f'{DATA_PATH}/spange_descriptors_lookup.csv', index_col=0)
SPANGE_COLS = [c for c in spange_df.columns if c != 'solvent smiles']

print(f'Spange descriptors: {len(SPANGE_COLS)} features')
print('Features:', SPANGE_COLS)

In [None]:
# Analyze the hardest solvents
print('=== HARDEST SOLVENTS ANALYSIS ===')
print()

# From previous analysis:
hardest_solvents = [
    ('Cyclohexane', 0.198108, 35.2),
    ('1,1,1,3,3,3-Hexafluoropropan-2-ol', 0.096369, 18.6),
    ('2,2,2-Trifluoroethanol', 0.041910, 8.1),
    ('DMA [N,N-Dimethylacetamide]', 0.037331, 8.0),
]

print('Top 4 hardest solvents (from LOO CV):')
for solvent, mse, pct in hardest_solvents:
    print(f'  {solvent}: MSE={mse:.6f}, {pct}% of total error')

print()
print('Total contribution: 69.9% of error from just 4 solvents!')
print()

# Analyze their Spange descriptors
print('Spange descriptors for hardest solvents:')
for solvent, mse, pct in hardest_solvents:
    if solvent in spange_df.index:
        desc = spange_df.loc[solvent, SPANGE_COLS]
        print(f'\n{solvent}:')
        for col in SPANGE_COLS:
            print(f'  {col}: {desc[col]:.3f}')

In [None]:
# Compare hardest vs easiest solvents
print('=== HARDEST vs EASIEST SOLVENTS ===')
print()

easiest_solvents = [
    ('Ethyl Acetate', 0.001),
    ('Methyl Propionate', 0.002),
    ('tert-Butanol', 0.003),
    ('THF', 0.004),
]

print('Comparing Spange descriptors:')
print()

# Get descriptors for hardest and easiest
hardest_names = [s[0] for s in hardest_solvents]
easiest_names = [s[0] for s in easiest_solvents]

for col in SPANGE_COLS:
    hardest_vals = [spange_df.loc[s, col] for s in hardest_names if s in spange_df.index]
    easiest_vals = [spange_df.loc[s, col] for s in easiest_names if s in spange_df.index]
    
    if hardest_vals and easiest_vals:
        hardest_mean = np.mean(hardest_vals)
        easiest_mean = np.mean(easiest_vals)
        diff = hardest_mean - easiest_mean
        print(f'{col}: Hardest={hardest_mean:.3f}, Easiest={easiest_mean:.3f}, Diff={diff:.3f}')

In [None]:
# Key insight: What makes Cyclohexane so hard?
print('=== WHY IS CYCLOHEXANE SO HARD? ===')
print()

if 'Cyclohexane' in spange_df.index:
    cyclohexane = spange_df.loc['Cyclohexane', SPANGE_COLS]
    print('Cyclohexane Spange descriptors:')
    for col in SPANGE_COLS:
        val = cyclohexane[col]
        # Compare to mean of all solvents
        mean_val = spange_df[col].mean()
        std_val = spange_df[col].std()
        z_score = (val - mean_val) / std_val if std_val > 0 else 0
        print(f'  {col}: {val:.3f} (z-score: {z_score:.2f})')

print()
print('Key observations:')
print('  - Cyclohexane is NON-POLAR (low polarity descriptors)')
print('  - Most other solvents are POLAR')
print('  - This makes Cyclohexane an OUTLIER in feature space')
print('  - The model has no similar solvents to learn from')

In [None]:
# What about HFIP?
print('=== WHY IS HFIP SO HARD? ===')
print()

hfip_name = '1,1,1,3,3,3-Hexafluoropropan-2-ol'
if hfip_name in spange_df.index:
    hfip = spange_df.loc[hfip_name, SPANGE_COLS]
    print('HFIP Spange descriptors:')
    for col in SPANGE_COLS:
        val = hfip[col]
        mean_val = spange_df[col].mean()
        std_val = spange_df[col].std()
        z_score = (val - mean_val) / std_val if std_val > 0 else 0
        print(f'  {col}: {val:.3f} (z-score: {z_score:.2f})')

print()
print('Key observations:')
print('  - HFIP is HIGHLY FLUORINATED')
print('  - Unique electronic properties')
print('  - Strong hydrogen bond donor')
print('  - No similar solvents in training set')

In [None]:
# Summary and recommendations
print('=== SUMMARY AND RECOMMENDATIONS ===')
print()
print('KEY FINDINGS:')
print('1. The CV-LB intercept (0.0525) > Target (0.0347)')
print('   - Current approach CANNOT reach target')
print('   - Need to change the relationship, not just improve CV')
print()
print('2. Top 4 hardest solvents account for 70% of error')
print('   - Cyclohexane (35.2%): Non-polar outlier')
print('   - HFIP (18.6%): Fluorinated outlier')
print('   - TFE (8.1%): Fluorinated')
print('   - DMA (8.0%): Amide')
print()
print('3. Adaptive weighting FAILED (exp_046)')
print('   - Up-weighting hard solvents does NOT help')
print('   - The hard solvents are hard because they are OOD')
print('   - Not because of training imbalance')
print()
print('RECOMMENDED APPROACHES:')
print()
print('1. SOLVENT CLUSTERING + SPECIALIZED MODELS (HIGHEST PRIORITY)')
print('   - Cluster solvents by Spange descriptors')
print('   - Train specialized models for each cluster')
print('   - For outliers (Cyclohexane, HFIP), use nearest neighbor approach')
print()
print('2. NEAREST NEIGHBOR BLENDING')
print('   - For each test solvent, find most similar training solvents')
print('   - Blend predictions from models trained on similar solvents')
print('   - This could help with OOD solvents')
print()
print('3. CONSERVATIVE PREDICTIONS FOR OOD SOLVENTS')
print('   - Detect when test solvent is OOD')
print('   - Use more conservative predictions (e.g., cluster mean)')
print('   - This could reduce extreme errors on Cyclohexane/HFIP')