# Loop 49 Analysis: Critical Strategic Assessment

## Current Situation
- **Best CV**: 0.008298 (exp_030)
- **Best LB**: 0.0877 (exp_030)
- **Target**: 0.0347
- **Gap**: 152.7% (need to reduce LB by 60%)
- **Remaining submissions**: 5

## Key Discovery from Evaluator
The evaluator correctly identified that:
1. Cosine similarity on Spange descriptors is NOT discriminative (all >0.99)
2. The hybrid feature ensemble (exp_048) didn't work because adaptive weighting never triggered
3. We need a different approach to identify OOD solvents

## The Core Problem
The CV-LB relationship has an intercept (0.0525) > target (0.0347). This means:
- Even with CV = 0, we'd get LB = 0.0525
- The target is mathematically UNREACHABLE by improving CV alone
- We need to CHANGE the relationship, not just improve CV

In [None]:
import numpy as np
import pandas as pd
from scipy import stats

# Submission history
submissions = [
    {'exp': 'exp_000', 'cv': 0.0111, 'lb': 0.0982},
    {'exp': 'exp_001', 'cv': 0.0123, 'lb': 0.1065},
    {'exp': 'exp_003', 'cv': 0.0105, 'lb': 0.0972},
    {'exp': 'exp_005', 'cv': 0.0104, 'lb': 0.0969},
    {'exp': 'exp_006', 'cv': 0.0097, 'lb': 0.0946},
    {'exp': 'exp_007', 'cv': 0.0093, 'lb': 0.0932},
    {'exp': 'exp_009', 'cv': 0.0092, 'lb': 0.0936},
    {'exp': 'exp_012', 'cv': 0.0090, 'lb': 0.0913},
    {'exp': 'exp_024', 'cv': 0.0087, 'lb': 0.0893},
    {'exp': 'exp_026', 'cv': 0.0085, 'lb': 0.0887},
    {'exp': 'exp_030', 'cv': 0.0083, 'lb': 0.0877},
    {'exp': 'exp_035', 'cv': 0.0098, 'lb': 0.0970},
]

df = pd.DataFrame(submissions)
print('Submission history:')
print(df)
print()

# Linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(df['cv'], df['lb'])
print(f'CV-LB Relationship: LB = {slope:.2f} * CV + {intercept:.4f}')
print(f'RÂ² = {r_value**2:.4f}')
print(f'Intercept = {intercept:.4f}')
print(f'Target = 0.0347')
print()
print(f'CRITICAL: Intercept ({intercept:.4f}) > Target (0.0347)')
print('Current approach CANNOT reach target!')

In [None]:
# Analyze what the mixall kernel does differently
print('=== MIXALL KERNEL ANALYSIS ===')
print()
print('Key differences from our approach:')
print('1. Uses GroupKFold (5 splits) instead of Leave-One-Out (24 folds)')
print('2. Uses EnsembleModel: MLP + XGBoost + RandomForest + LightGBM')
print('3. Uses Spange descriptors only (no DRFP)')
print('4. Has Optuna hyperparameter tuning')
print()
print('IMPORTANT: The mixall kernel OVERWRITES the utility functions!')
print('This means their local CV is NOT comparable to ours.')
print('BUT: The LB evaluation uses the OFFICIAL scheme.')
print()
print('Hypothesis: The mixall kernel\'s success on LB is due to:')
print('1. The ensemble model (MLP + XGBoost + RF + LightGBM)')
print('2. NOT the validation scheme change')
print()
print('We should try their ensemble approach with our validation scheme!')

In [None]:
# What approaches haven't been tried?
print('=== UNEXPLORED APPROACHES ===')
print()
print('1. MIXALL-STYLE ENSEMBLE (MLP + XGBoost + RF + LightGBM)')
print('   - We have GP + MLP + LGBM, but not XGBoost or RF')
print('   - The mixall kernel uses 4 models with learned weights')
print('   - This is a fundamentally different ensemble structure')
print()
print('2. MANUAL OOD SOLVENT HANDLING')
print('   - Instead of automatic OOD detection, manually identify high-error solvents')
print('   - HFIP, Cyclohexane, TFE are known to be problematic')
print('   - Use simpler features for these specific solvents')
print()
print('3. ENSEMBLE DISAGREEMENT FOR OOD DETECTION')
print('   - Use variance of ensemble predictions as OOD indicator')
print('   - High variance = OOD = use simpler model')
print()
print('4. SOLVENT-SPECIFIC MODELS')
print('   - Train separate models for different solvent types')
print('   - Polar vs non-polar, fluorinated vs non-fluorinated')

In [None]:
# Load data to analyze per-solvent errors
import sys
sys.path.append('/home/data')

DATA_PATH = '/home/data'
df_single = pd.read_csv(f'{DATA_PATH}/catechol_single_solvent_yields.csv')
spange_df = pd.read_csv(f'{DATA_PATH}/spange_descriptors_lookup.csv', index_col=0)

print('=== SOLVENT ANALYSIS ===')
print()
print(f'Number of solvents: {df_single["SOLVENT NAME"].nunique()}')
print()
print('Solvents:')
for s in sorted(df_single['SOLVENT NAME'].unique()):
    print(f'  - {s}')

In [None]:
# Analyze Spange descriptors to understand solvent clustering
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

SPANGE_COLS = [c for c in spange_df.columns if c != 'solvent smiles']
print(f'Spange features: {SPANGE_COLS}')
print()

# Get solvents that are in our dataset
solvents = df_single['SOLVENT NAME'].unique()
spange_data = spange_df.loc[solvents, SPANGE_COLS].values

# Standardize
scaler = StandardScaler()
spange_scaled = scaler.fit_transform(spange_data)

# Cluster into 3 groups
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
clusters = kmeans.fit_predict(spange_scaled)

print('=== SOLVENT CLUSTERS ===')
for i in range(3):
    print(f'\nCluster {i}:')
    for j, s in enumerate(solvents):
        if clusters[j] == i:
            print(f'  - {s}')

In [None]:
# Identify outlier solvents based on distance from cluster centers
from sklearn.metrics import pairwise_distances

# Compute distance to nearest cluster center
distances = pairwise_distances(spange_scaled, kmeans.cluster_centers_)
min_distances = distances.min(axis=1)

# Sort by distance (higher = more outlier)
outlier_scores = list(zip(solvents, min_distances))
outlier_scores.sort(key=lambda x: x[1], reverse=True)

print('=== OUTLIER SOLVENTS (by distance from cluster center) ===')
print()
for s, d in outlier_scores[:10]:
    print(f'{s}: {d:.4f}')

print()
print('Key insight: These are the solvents that are most different from the rest.')
print('They may benefit from simpler features or specialized handling.')

In [None]:
# Analyze the per-solvent errors from exp_048
print('=== PER-SOLVENT ERRORS FROM EXP_048 ===')
print()

# From the experiment output
per_solvent_errors = {
    '1,1,1,3,3,3-Hexafluoropropan-2-ol': 0.038187,  # HFIP
    '2,2,2-Trifluoroethanol': 0.015347,  # TFE
    '2-Methyltetrahydrofuran [2-MeTHF]': 0.002187,
    'Acetonitrile': 0.008555,
    'Acetonitrile.Acetic Acid': 0.021528,
    'Butanone [MEK]': 0.004194,
    'Cyclohexane': 0.004116,
    'DMA [N,N-Dimethylacetamide]': 0.007208,
    'Decanol': 0.012753,
    'Diethyl Ether [Ether]': 0.012611,
    'Dihydrolevoglucosenone (Cyrene)': 0.007900,
    'Dimethyl Carbonate': 0.012755,
    'Ethanol': 0.002654,
    'Ethyl Acetate': 0.001168,
    'Ethyl Lactate': 0.002163,
    'Ethylene Glycol [1,2-Ethanediol]': 0.014847,
    'IPA [Propan-2-ol]': 0.011289,
    'MTBE [tert-Butylmethylether]': 0.007583,
    'Methanol': 0.004234,
    'Methyl Propionate': 0.001243,
    'THF [Tetrahydrofuran]': 0.001263,
    'Water.2,2,2-Trifluoroethanol': 0.004976,
    'Water.Acetonitrile': 0.011855,
    'Water.Ethanol': 0.028236,
}

# Sort by error
sorted_errors = sorted(per_solvent_errors.items(), key=lambda x: x[1], reverse=True)

print('Top 10 highest error solvents:')
for s, e in sorted_errors[:10]:
    print(f'  {s}: {e:.6f}')

print()
print('Top 10 lowest error solvents:')
for s, e in sorted_errors[-10:]:
    print(f'  {s}: {e:.6f}')

In [None]:
# Strategic recommendation
print('=== STRATEGIC RECOMMENDATION ===')
print()
print('PRIORITY 1: Manual OOD Solvent Handling')
print('  - Identify high-error solvents: HFIP, Water.Ethanol, Acetonitrile.Acetic Acid')
print('  - For these solvents, use simpler features (Spange only, no DRFP)')
print('  - This is a direct implementation of the evaluator\'s suggestion')
print()
print('PRIORITY 2: Mixall-Style Ensemble')
print('  - Try MLP + XGBoost + RF + LightGBM ensemble')
print('  - This is fundamentally different from our GP + MLP + LGBM')
print('  - The mixall kernel achieves good LB with this approach')
print()
print('PRIORITY 3: Ensemble Disagreement for OOD Detection')
print('  - Use variance of ensemble predictions as OOD indicator')
print('  - High variance = OOD = use simpler model')
print()
print('SUBMISSION STRATEGY (5 remaining):')
print('  1. Test manual OOD handling (exp_049)')
print('  2. If promising, submit to verify CV-LB relationship change')
print('  3. If not, try mixall-style ensemble (exp_050)')
print('  4. Save 2-3 submissions for final refinements')