# Loop 49 Analysis: Critical Strategic Assessment

## Current Situation
- **Best CV**: 0.008298 (exp_030)
- **Best LB**: 0.0877 (exp_030)
- **Target**: 0.0347
- **Gap**: 152.7% (need to reduce LB by 60%)
- **Remaining submissions**: 5

## Key Discovery from Evaluator
The evaluator correctly identified that:
1. Cosine similarity on Spange descriptors is NOT discriminative (all >0.99)
2. The hybrid feature ensemble (exp_048) didn't work because adaptive weighting never triggered
3. We need a different approach to identify OOD solvents

## The Core Problem
The CV-LB relationship has an intercept (0.0525) > target (0.0347). This means:
- Even with CV = 0, we'd get LB = 0.0525
- The target is mathematically UNREACHABLE by improving CV alone
- We need to CHANGE the relationship, not just improve CV

In [1]:
import numpy as np
import pandas as pd
from scipy import stats

# Submission history
submissions = [
    {'exp': 'exp_000', 'cv': 0.0111, 'lb': 0.0982},
    {'exp': 'exp_001', 'cv': 0.0123, 'lb': 0.1065},
    {'exp': 'exp_003', 'cv': 0.0105, 'lb': 0.0972},
    {'exp': 'exp_005', 'cv': 0.0104, 'lb': 0.0969},
    {'exp': 'exp_006', 'cv': 0.0097, 'lb': 0.0946},
    {'exp': 'exp_007', 'cv': 0.0093, 'lb': 0.0932},
    {'exp': 'exp_009', 'cv': 0.0092, 'lb': 0.0936},
    {'exp': 'exp_012', 'cv': 0.0090, 'lb': 0.0913},
    {'exp': 'exp_024', 'cv': 0.0087, 'lb': 0.0893},
    {'exp': 'exp_026', 'cv': 0.0085, 'lb': 0.0887},
    {'exp': 'exp_030', 'cv': 0.0083, 'lb': 0.0877},
    {'exp': 'exp_035', 'cv': 0.0098, 'lb': 0.0970},
]

df = pd.DataFrame(submissions)
print('Submission history:')
print(df)
print()

# Linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(df['cv'], df['lb'])
print(f'CV-LB Relationship: LB = {slope:.2f} * CV + {intercept:.4f}')
print(f'R² = {r_value**2:.4f}')
print(f'Intercept = {intercept:.4f}')
print(f'Target = 0.0347')
print()
print(f'CRITICAL: Intercept ({intercept:.4f}) > Target (0.0347)')
print('Current approach CANNOT reach target!')

Submission history:
        exp      cv      lb
0   exp_000  0.0111  0.0982
1   exp_001  0.0123  0.1065
2   exp_003  0.0105  0.0972
3   exp_005  0.0104  0.0969
4   exp_006  0.0097  0.0946
5   exp_007  0.0093  0.0932
6   exp_009  0.0092  0.0936
7   exp_012  0.0090  0.0913
8   exp_024  0.0087  0.0893
9   exp_026  0.0085  0.0887
10  exp_030  0.0083  0.0877
11  exp_035  0.0098  0.0970

CV-LB Relationship: LB = 4.31 * CV + 0.0525
R² = 0.9505
Intercept = 0.0525
Target = 0.0347

CRITICAL: Intercept (0.0525) > Target (0.0347)
Current approach CANNOT reach target!


In [2]:
# Analyze what the mixall kernel does differently
print('=== MIXALL KERNEL ANALYSIS ===')
print()
print('Key differences from our approach:')
print('1. Uses GroupKFold (5 splits) instead of Leave-One-Out (24 folds)')
print('2. Uses EnsembleModel: MLP + XGBoost + RandomForest + LightGBM')
print('3. Uses Spange descriptors only (no DRFP)')
print('4. Has Optuna hyperparameter tuning')
print()
print('IMPORTANT: The mixall kernel OVERWRITES the utility functions!')
print('This means their local CV is NOT comparable to ours.')
print('BUT: The LB evaluation uses the OFFICIAL scheme.')
print()
print('Hypothesis: The mixall kernel\'s success on LB is due to:')
print('1. The ensemble model (MLP + XGBoost + RF + LightGBM)')
print('2. NOT the validation scheme change')
print()
print('We should try their ensemble approach with our validation scheme!')

=== MIXALL KERNEL ANALYSIS ===

Key differences from our approach:
1. Uses GroupKFold (5 splits) instead of Leave-One-Out (24 folds)
2. Uses EnsembleModel: MLP + XGBoost + RandomForest + LightGBM
3. Uses Spange descriptors only (no DRFP)
4. Has Optuna hyperparameter tuning

IMPORTANT: The mixall kernel OVERWRITES the utility functions!
This means their local CV is NOT comparable to ours.
BUT: The LB evaluation uses the OFFICIAL scheme.

Hypothesis: The mixall kernel's success on LB is due to:
1. The ensemble model (MLP + XGBoost + RF + LightGBM)
2. NOT the validation scheme change

We should try their ensemble approach with our validation scheme!


In [3]:
# What approaches haven't been tried?
print('=== UNEXPLORED APPROACHES ===')
print()
print('1. MIXALL-STYLE ENSEMBLE (MLP + XGBoost + RF + LightGBM)')
print('   - We have GP + MLP + LGBM, but not XGBoost or RF')
print('   - The mixall kernel uses 4 models with learned weights')
print('   - This is a fundamentally different ensemble structure')
print()
print('2. MANUAL OOD SOLVENT HANDLING')
print('   - Instead of automatic OOD detection, manually identify high-error solvents')
print('   - HFIP, Cyclohexane, TFE are known to be problematic')
print('   - Use simpler features for these specific solvents')
print()
print('3. ENSEMBLE DISAGREEMENT FOR OOD DETECTION')
print('   - Use variance of ensemble predictions as OOD indicator')
print('   - High variance = OOD = use simpler model')
print()
print('4. SOLVENT-SPECIFIC MODELS')
print('   - Train separate models for different solvent types')
print('   - Polar vs non-polar, fluorinated vs non-fluorinated')

=== UNEXPLORED APPROACHES ===

1. MIXALL-STYLE ENSEMBLE (MLP + XGBoost + RF + LightGBM)
   - We have GP + MLP + LGBM, but not XGBoost or RF
   - The mixall kernel uses 4 models with learned weights
   - This is a fundamentally different ensemble structure

2. MANUAL OOD SOLVENT HANDLING
   - Instead of automatic OOD detection, manually identify high-error solvents
   - HFIP, Cyclohexane, TFE are known to be problematic
   - Use simpler features for these specific solvents

3. ENSEMBLE DISAGREEMENT FOR OOD DETECTION
   - Use variance of ensemble predictions as OOD indicator
   - High variance = OOD = use simpler model

4. SOLVENT-SPECIFIC MODELS
   - Train separate models for different solvent types
   - Polar vs non-polar, fluorinated vs non-fluorinated


In [4]:
# Load data to analyze per-solvent errors
import sys
sys.path.append('/home/data')

DATA_PATH = '/home/data'
df_single = pd.read_csv(f'{DATA_PATH}/catechol_single_solvent_yields.csv')
spange_df = pd.read_csv(f'{DATA_PATH}/spange_descriptors_lookup.csv', index_col=0)

print('=== SOLVENT ANALYSIS ===')
print()
print(f'Number of solvents: {df_single["SOLVENT NAME"].nunique()}')
print()
print('Solvents:')
for s in sorted(df_single['SOLVENT NAME'].unique()):
    print(f'  - {s}')

=== SOLVENT ANALYSIS ===

Number of solvents: 24

Solvents:
  - 1,1,1,3,3,3-Hexafluoropropan-2-ol
  - 2,2,2-Trifluoroethanol
  - 2-Methyltetrahydrofuran [2-MeTHF]
  - Acetonitrile
  - Acetonitrile.Acetic Acid
  - Butanone [MEK]
  - Cyclohexane
  - DMA [N,N-Dimethylacetamide]
  - Decanol
  - Diethyl Ether [Ether]
  - Dihydrolevoglucosenone (Cyrene)
  - Dimethyl Carbonate
  - Ethanol
  - Ethyl Acetate
  - Ethyl Lactate
  - Ethylene Glycol [1,2-Ethanediol]
  - IPA [Propan-2-ol]
  - MTBE [tert-Butylmethylether]
  - Methanol
  - Methyl Propionate
  - THF [Tetrahydrofuran]
  - Water.2,2,2-Trifluoroethanol
  - Water.Acetonitrile
  - tert-Butanol [2-Methylpropan-2-ol]


In [5]:
# Analyze Spange descriptors to understand solvent clustering
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

SPANGE_COLS = [c for c in spange_df.columns if c != 'solvent smiles']
print(f'Spange features: {SPANGE_COLS}')
print()

# Get solvents that are in our dataset
solvents = df_single['SOLVENT NAME'].unique()
spange_data = spange_df.loc[solvents, SPANGE_COLS].values

# Standardize
scaler = StandardScaler()
spange_scaled = scaler.fit_transform(spange_data)

# Cluster into 3 groups
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
clusters = kmeans.fit_predict(spange_scaled)

print('=== SOLVENT CLUSTERS ===')
for i in range(3):
    print(f'\nCluster {i}:')
    for j, s in enumerate(solvents):
        if clusters[j] == i:
            print(f'  - {s}')

Spange features: ['dielectric constant', 'ET(30)', 'alpha', 'beta', 'pi*', 'SA', 'SB', 'SP', 'SdP', 'N', 'n', 'f(n)', 'delta']



=== SOLVENT CLUSTERS ===

Cluster 0:
  - Methanol
  - Ethylene Glycol [1,2-Ethanediol]
  - Water.Acetonitrile
  - Acetonitrile
  - Acetonitrile.Acetic Acid
  - Water.2,2,2-Trifluoroethanol
  - Ethanol

Cluster 1:
  - 2-Methyltetrahydrofuran [2-MeTHF]
  - Cyclohexane
  - IPA [Propan-2-ol]
  - Diethyl Ether [Ether]
  - DMA [N,N-Dimethylacetamide]
  - Decanol
  - THF [Tetrahydrofuran]
  - Dihydrolevoglucosenone (Cyrene)
  - Ethyl Acetate
  - MTBE [tert-Butylmethylether]
  - Butanone [MEK]
  - tert-Butanol [2-Methylpropan-2-ol]
  - Dimethyl Carbonate
  - Methyl Propionate
  - Ethyl Lactate

Cluster 2:
  - 1,1,1,3,3,3-Hexafluoropropan-2-ol
  - 2,2,2-Trifluoroethanol


In [6]:
# Identify outlier solvents based on distance from cluster centers
from sklearn.metrics import pairwise_distances

# Compute distance to nearest cluster center
distances = pairwise_distances(spange_scaled, kmeans.cluster_centers_)
min_distances = distances.min(axis=1)

# Sort by distance (higher = more outlier)
outlier_scores = list(zip(solvents, min_distances))
outlier_scores.sort(key=lambda x: x[1], reverse=True)

print('=== OUTLIER SOLVENTS (by distance from cluster center) ===')
print()
for s, d in outlier_scores[:10]:
    print(f'{s}: {d:.4f}')

print()
print('Key insight: These are the solvents that are most different from the rest.')
print('They may benefit from simpler features or specialized handling.')

=== OUTLIER SOLVENTS (by distance from cluster center) ===

Cyclohexane: 4.6286
Water.2,2,2-Trifluoroethanol: 3.5995
DMA [N,N-Dimethylacetamide]: 3.5184
Ethylene Glycol [1,2-Ethanediol]: 3.3913
Water.Acetonitrile: 3.1368
Decanol: 2.8265
Dihydrolevoglucosenone (Cyrene): 2.7578
IPA [Propan-2-ol]: 2.7159
Ethanol: 2.4800
tert-Butanol [2-Methylpropan-2-ol]: 2.4728

Key insight: These are the solvents that are most different from the rest.
They may benefit from simpler features or specialized handling.


In [7]:
# Analyze the per-solvent errors from exp_048
print('=== PER-SOLVENT ERRORS FROM EXP_048 ===')
print()

# From the experiment output
per_solvent_errors = {
    '1,1,1,3,3,3-Hexafluoropropan-2-ol': 0.038187,  # HFIP
    '2,2,2-Trifluoroethanol': 0.015347,  # TFE
    '2-Methyltetrahydrofuran [2-MeTHF]': 0.002187,
    'Acetonitrile': 0.008555,
    'Acetonitrile.Acetic Acid': 0.021528,
    'Butanone [MEK]': 0.004194,
    'Cyclohexane': 0.004116,
    'DMA [N,N-Dimethylacetamide]': 0.007208,
    'Decanol': 0.012753,
    'Diethyl Ether [Ether]': 0.012611,
    'Dihydrolevoglucosenone (Cyrene)': 0.007900,
    'Dimethyl Carbonate': 0.012755,
    'Ethanol': 0.002654,
    'Ethyl Acetate': 0.001168,
    'Ethyl Lactate': 0.002163,
    'Ethylene Glycol [1,2-Ethanediol]': 0.014847,
    'IPA [Propan-2-ol]': 0.011289,
    'MTBE [tert-Butylmethylether]': 0.007583,
    'Methanol': 0.004234,
    'Methyl Propionate': 0.001243,
    'THF [Tetrahydrofuran]': 0.001263,
    'Water.2,2,2-Trifluoroethanol': 0.004976,
    'Water.Acetonitrile': 0.011855,
    'Water.Ethanol': 0.028236,
}

# Sort by error
sorted_errors = sorted(per_solvent_errors.items(), key=lambda x: x[1], reverse=True)

print('Top 10 highest error solvents:')
for s, e in sorted_errors[:10]:
    print(f'  {s}: {e:.6f}')

print()
print('Top 10 lowest error solvents:')
for s, e in sorted_errors[-10:]:
    print(f'  {s}: {e:.6f}')

=== PER-SOLVENT ERRORS FROM EXP_048 ===

Top 10 highest error solvents:
  1,1,1,3,3,3-Hexafluoropropan-2-ol: 0.038187
  Water.Ethanol: 0.028236
  Acetonitrile.Acetic Acid: 0.021528
  2,2,2-Trifluoroethanol: 0.015347
  Ethylene Glycol [1,2-Ethanediol]: 0.014847
  Dimethyl Carbonate: 0.012755
  Decanol: 0.012753
  Diethyl Ether [Ether]: 0.012611
  Water.Acetonitrile: 0.011855
  IPA [Propan-2-ol]: 0.011289

Top 10 lowest error solvents:
  Water.2,2,2-Trifluoroethanol: 0.004976
  Methanol: 0.004234
  Butanone [MEK]: 0.004194
  Cyclohexane: 0.004116
  Ethanol: 0.002654
  2-Methyltetrahydrofuran [2-MeTHF]: 0.002187
  Ethyl Lactate: 0.002163
  THF [Tetrahydrofuran]: 0.001263
  Methyl Propionate: 0.001243
  Ethyl Acetate: 0.001168


In [8]:
# Strategic recommendation
print('=== STRATEGIC RECOMMENDATION ===')
print()
print('PRIORITY 1: Manual OOD Solvent Handling')
print('  - Identify high-error solvents: HFIP, Water.Ethanol, Acetonitrile.Acetic Acid')
print('  - For these solvents, use simpler features (Spange only, no DRFP)')
print('  - This is a direct implementation of the evaluator\'s suggestion')
print()
print('PRIORITY 2: Mixall-Style Ensemble')
print('  - Try MLP + XGBoost + RF + LightGBM ensemble')
print('  - This is fundamentally different from our GP + MLP + LGBM')
print('  - The mixall kernel achieves good LB with this approach')
print()
print('PRIORITY 3: Ensemble Disagreement for OOD Detection')
print('  - Use variance of ensemble predictions as OOD indicator')
print('  - High variance = OOD = use simpler model')
print()
print('SUBMISSION STRATEGY (5 remaining):')
print('  1. Test manual OOD handling (exp_049)')
print('  2. If promising, submit to verify CV-LB relationship change')
print('  3. If not, try mixall-style ensemble (exp_050)')
print('  4. Save 2-3 submissions for final refinements')

=== STRATEGIC RECOMMENDATION ===

PRIORITY 1: Manual OOD Solvent Handling
  - Identify high-error solvents: HFIP, Water.Ethanol, Acetonitrile.Acetic Acid
  - For these solvents, use simpler features (Spange only, no DRFP)
  - This is a direct implementation of the evaluator's suggestion

PRIORITY 2: Mixall-Style Ensemble
  - Try MLP + XGBoost + RF + LightGBM ensemble
  - This is fundamentally different from our GP + MLP + LGBM
  - The mixall kernel achieves good LB with this approach

PRIORITY 3: Ensemble Disagreement for OOD Detection
  - Use variance of ensemble predictions as OOD indicator
  - High variance = OOD = use simpler model

SUBMISSION STRATEGY (5 remaining):
  1. Test manual OOD handling (exp_049)
  2. If promising, submit to verify CV-LB relationship change
  3. If not, try mixall-style ensemble (exp_050)
  4. Save 2-3 submissions for final refinements
