# Loop 18 Strategic Analysis

## Current Situation
- Best CV: 0.0623 (exp_004/017)
- Best LB: 0.0956 (exp_004/016)
- Target: 0.01727
- Gap to target: 5.5x
- CV-LB gap: 53% (0.0623 â†’ 0.0956)
- Remaining submissions: 2

## Key Findings from exp_018
- DRFP features HURT performance (0.0681 vs 0.0623)
- DRFP is 97.43% sparse - doesn't work well with tree-based models
- The paper's success came from GNN architecture, not just DRFP features

In [1]:
import pandas as pd
import numpy as np

# Load data to understand the problem better
DATA_PATH = '/home/data'

single = pd.read_csv(f'{DATA_PATH}/catechol_single_solvent_yields.csv')
full = pd.read_csv(f'{DATA_PATH}/catechol_full_data_yields.csv')

print(f"Single solvent: {single.shape}")
print(f"Full data: {full.shape}")
print(f"\nSingle solvents: {single['SOLVENT NAME'].nunique()}")
print(f"Full solvent pairs: {full[['SOLVENT A NAME', 'SOLVENT B NAME']].drop_duplicates().shape[0]}")

Single solvent: (656, 13)
Full data: (1227, 19)

Single solvents: 24
Full solvent pairs: 13


In [2]:
# Analyze the CV-LB gap pattern
submissions = [
    {'exp': 'exp_004', 'cv': 0.0623, 'lb': 0.0956, 'gap': (0.0956-0.0623)/0.0623},
    {'exp': 'exp_006', 'cv': 0.0688, 'lb': 0.0991, 'gap': (0.0991-0.0688)/0.0688},
    {'exp': 'exp_011', 'cv': 0.0844, 'lb': None, 'gap': None},
    {'exp': 'exp_016', 'cv': 0.0623, 'lb': 0.0956, 'gap': (0.0956-0.0623)/0.0623},
]

print("=== CV-LB Gap Analysis ===")
for s in submissions:
    if s['lb']:
        print(f"{s['exp']}: CV={s['cv']:.4f}, LB={s['lb']:.4f}, Gap={s['gap']*100:.1f}%")
    else:
        print(f"{s['exp']}: CV={s['cv']:.4f}, LB=pending")

print("\n=== Key Insight ===")
print("The CV-LB gap is CONSISTENT at ~53% for both exp_004 and exp_016")
print("This suggests the test set has fundamentally different solvents")
print("Our models are NOT generalizing to unseen solvents")

=== CV-LB Gap Analysis ===
exp_004: CV=0.0623, LB=0.0956, Gap=53.5%
exp_006: CV=0.0688, LB=0.0991, Gap=44.0%
exp_011: CV=0.0844, LB=pending
exp_016: CV=0.0623, LB=0.0956, Gap=53.5%

=== Key Insight ===
The CV-LB gap is CONSISTENT at ~53% for both exp_004 and exp_016
This suggests the test set has fundamentally different solvents
Our models are NOT generalizing to unseen solvents


In [3]:
# What approaches have been tried?
experiments = [
    ('exp_000', 0.0814, 'Baseline ensemble (MLP+XGB+LGB+RF)'),
    ('exp_001', 0.0810, 'Template-compliant ensemble'),
    ('exp_002', 0.0805, 'Simple RF regularized'),
    ('exp_003', 0.0813, 'Per-target HGB+ETR'),
    ('exp_004', 0.0623, 'Per-target NO TTA - BEST'),
    ('exp_005', 0.0896, 'Ridge regression'),
    ('exp_006', 0.0688, 'Intermediate regularization'),
    ('exp_007', 0.0721, 'Gaussian Process'),
    ('exp_008', 0.0673, 'Diverse ensemble'),
    ('exp_009', 0.0669, 'MLP + GBDT ensemble'),
    ('exp_010', 0.0841, 'GroupKFold validation'),
    ('exp_011', 0.0844, 'Template-compliant GroupKFold'),
    ('exp_012', 0.0827, 'LOO ensemble'),
    ('exp_013', 0.0834, 'Optuna per-target'),
    ('exp_014', 0.0834, 'MLP per-target combined'),
    ('exp_015', 0.0928, 'Hybrid task-specific'),
    ('exp_016', 0.0623, 'Replicate exp_004'),
    ('exp_017', 0.0681, 'DRFP ensemble'),
]

print("=== Experiment Summary ===")
for exp, cv, desc in sorted(experiments, key=lambda x: x[1]):
    print(f"{exp}: CV={cv:.4f} - {desc}")

=== Experiment Summary ===
exp_004: CV=0.0623 - Per-target NO TTA - BEST
exp_016: CV=0.0623 - Replicate exp_004
exp_009: CV=0.0669 - MLP + GBDT ensemble
exp_008: CV=0.0673 - Diverse ensemble
exp_017: CV=0.0681 - DRFP ensemble
exp_006: CV=0.0688 - Intermediate regularization
exp_007: CV=0.0721 - Gaussian Process
exp_002: CV=0.0805 - Simple RF regularized
exp_001: CV=0.0810 - Template-compliant ensemble
exp_003: CV=0.0813 - Per-target HGB+ETR
exp_000: CV=0.0814 - Baseline ensemble (MLP+XGB+LGB+RF)
exp_012: CV=0.0827 - LOO ensemble
exp_013: CV=0.0834 - Optuna per-target
exp_014: CV=0.0834 - MLP per-target combined
exp_010: CV=0.0841 - GroupKFold validation
exp_011: CV=0.0844 - Template-compliant GroupKFold
exp_005: CV=0.0896 - Ridge regression
exp_015: CV=0.0928 - Hybrid task-specific


In [4]:
# What HASN'T been tried?
print("=== APPROACHES NOT YET TRIED ===")
print()
print("1. NEURAL NETWORK APPROACHES:")
print("   - GNN (Graph Neural Network) - paper shows this achieves target")
print("   - Transformer-based models (BERT on reaction SMILES)")
print("   - Pre-trained molecular models")
print()
print("2. FEATURE ENGINEERING:")
print("   - Physics-based features (quantum descriptors)")
print("   - Solvent similarity features (Tanimoto to training solvents)")
print("   - Chemical class encoding (alcohols, ethers, etc.)")
print()
print("3. VALIDATION STRATEGIES:")
print("   - Adversarial validation to identify drifting features")
print("   - Solvent-similarity-weighted predictions")
print()
print("4. MODEL ARCHITECTURES:")
print("   - Bayesian neural networks for uncertainty")
print("   - Multi-task learning across targets")
print("   - Attention mechanisms for solvent-reaction interactions")

=== APPROACHES NOT YET TRIED ===

1. NEURAL NETWORK APPROACHES:
   - GNN (Graph Neural Network) - paper shows this achieves target
   - Transformer-based models (BERT on reaction SMILES)
   - Pre-trained molecular models

2. FEATURE ENGINEERING:
   - Physics-based features (quantum descriptors)
   - Solvent similarity features (Tanimoto to training solvents)
   - Chemical class encoding (alcohols, ethers, etc.)

3. VALIDATION STRATEGIES:
   - Adversarial validation to identify drifting features
   - Solvent-similarity-weighted predictions

4. MODEL ARCHITECTURES:
   - Bayesian neural networks for uncertainty
   - Multi-task learning across targets
   - Attention mechanisms for solvent-reaction interactions


In [5]:
# Analyze what the research suggests
print("=== RESEARCH INSIGHTS ===")
print()
print("From arxiv:2512.19530 (Catechol benchmark paper):")
print("- GNN + DRFP achieved MSE 0.0039 (our target is 0.01727)")
print("- Key: GNN architecture, not just DRFP features")
print("- Multimodal approach: graph + physicochemical descriptors")
print()
print("From web search on OOD molecular prediction:")
print("- Pre-trained GNNs with transfer learning (8x improvement)")
print("- Task-similarity-driven source selection")
print("- Chemical-aware regularization")
print()
print("From top kernel (lishellliang):")
print("- Uses GroupKFold (5-fold) instead of LOO")
print("- MLP + XGBoost + RF + LightGBM ensemble")
print("- Optuna for hyperparameter optimization")
print("- BUT: Their approach is similar to what we've tried")

=== RESEARCH INSIGHTS ===

From arxiv:2512.19530 (Catechol benchmark paper):
- GNN + DRFP achieved MSE 0.0039 (our target is 0.01727)
- Key: GNN architecture, not just DRFP features
- Multimodal approach: graph + physicochemical descriptors

From web search on OOD molecular prediction:
- Pre-trained GNNs with transfer learning (8x improvement)
- Task-similarity-driven source selection
- Chemical-aware regularization

From top kernel (lishellliang):
- Uses GroupKFold (5-fold) instead of LOO
- MLP + XGBoost + RF + LightGBM ensemble
- Optuna for hyperparameter optimization
- BUT: Their approach is similar to what we've tried


In [6]:
# Key strategic question: What's the path to 0.01727?
print("=== PATH TO TARGET ===")
print()
print("Current best: LB 0.0956")
print("Target: 0.01727")
print("Gap: 5.5x")
print()
print("OPTION A: Optimize current approach (LOW POTENTIAL)")
print("- Best CV is 0.0623, best LB is 0.0956")
print("- Even if CV improves to 0.05, LB might only improve to ~0.08")
print("- Tree-based models have hit their ceiling")
print()
print("OPTION B: Neural Network approach (HIGH POTENTIAL)")
print("- Paper shows GNN achieves MSE 0.0039")
print("- Need to implement GNN or use pre-trained models")
print("- BUT: Complex to implement in limited time")
print()
print("OPTION C: Focus on generalization (MEDIUM POTENTIAL)")
print("- The 53% CV-LB gap is the real problem")
print("- If we can reduce the gap, LB will improve")
print("- Try: solvent similarity features, uncertainty weighting")
print()
print("RECOMMENDATION: Try Option C first (faster), then Option B if time permits")

=== PATH TO TARGET ===

Current best: LB 0.0956
Target: 0.01727
Gap: 5.5x

OPTION A: Optimize current approach (LOW POTENTIAL)
- Best CV is 0.0623, best LB is 0.0956
- Even if CV improves to 0.05, LB might only improve to ~0.08
- Tree-based models have hit their ceiling

OPTION B: Neural Network approach (HIGH POTENTIAL)
- Paper shows GNN achieves MSE 0.0039
- Need to implement GNN or use pre-trained models
- BUT: Complex to implement in limited time

OPTION C: Focus on generalization (MEDIUM POTENTIAL)
- The 53% CV-LB gap is the real problem
- If we can reduce the gap, LB will improve
- Try: solvent similarity features, uncertainty weighting

RECOMMENDATION: Try Option C first (faster), then Option B if time permits


In [7]:
# Analyze solvent similarity - could this help generalization?
import pandas as pd
import numpy as np

# Load solvent features
spange = pd.read_csv(f'{DATA_PATH}/spange_descriptors_lookup.csv', index_col=0)
acs_pca = pd.read_csv(f'{DATA_PATH}/acs_pca_descriptors_lookup.csv', index_col=0)
drfp = pd.read_csv(f'{DATA_PATH}/drfps_catechol_lookup.csv', index_col=0)

print(f"Spange solvents: {spange.shape[0]}")
print(f"ACS_PCA solvents: {acs_pca.shape[0]}")
print(f"DRFP solvents: {drfp.shape[0]}")

# Check which solvents are in training data
train_solvents = set(single['SOLVENT NAME'].unique())
print(f"\nTraining solvents: {len(train_solvents)}")
print(train_solvents)

Spange solvents: 26
ACS_PCA solvents: 24
DRFP solvents: 24

Training solvents: 24
{'Water.2,2,2-Trifluoroethanol', 'IPA [Propan-2-ol]', 'Ethyl Lactate', 'MTBE [tert-Butylmethylether]', '2,2,2-Trifluoroethanol', 'tert-Butanol [2-Methylpropan-2-ol]', 'Ethanol', 'Methanol', 'Water.Acetonitrile', 'DMA [N,N-Dimethylacetamide]', 'THF [Tetrahydrofuran]', 'Cyclohexane', 'Diethyl Ether [Ether]', 'Dimethyl Carbonate', 'Ethylene Glycol [1,2-Ethanediol]', '1,1,1,3,3,3-Hexafluoropropan-2-ol', 'Ethyl Acetate', '2-Methyltetrahydrofuran [2-MeTHF]', 'Decanol', 'Dihydrolevoglucosenone (Cyrene)', 'Acetonitrile', 'Methyl Propionate', 'Butanone [MEK]', 'Acetonitrile.Acetic Acid'}


In [8]:
# Calculate solvent similarity matrix using Spange features
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler

# Standardize features
scaler = StandardScaler()
spange_scaled = scaler.fit_transform(spange)

# Calculate similarity matrix
sim_matrix = cosine_similarity(spange_scaled)
sim_df = pd.DataFrame(sim_matrix, index=spange.index, columns=spange.index)

print("=== Solvent Similarity Analysis ===")
print(f"Similarity matrix shape: {sim_df.shape}")
print(f"\nMean similarity: {sim_matrix[np.triu_indices_from(sim_matrix, k=1)].mean():.3f}")
print(f"Min similarity: {sim_matrix[np.triu_indices_from(sim_matrix, k=1)].min():.3f}")
print(f"Max similarity: {sim_matrix[np.triu_indices_from(sim_matrix, k=1)].max():.3f}")

=== Solvent Similarity Analysis ===
Similarity matrix shape: (26, 26)

Mean similarity: -0.016
Min similarity: -0.901
Max similarity: 0.988


In [9]:
# Key insight: If test solvents are very different from training solvents,
# we need features that capture GENERALIZABLE chemical properties

print("=== GENERALIZATION STRATEGY ===")
print()
print("The problem: Test set has UNSEEN solvents")
print("Our models memorize training solvents, don't generalize")
print()
print("Solution approaches:")
print()
print("1. SOLVENT-AGNOSTIC FEATURES:")
print("   - Focus on reaction kinetics (Arrhenius) - already doing this")
print("   - Add more physics-based features (activation energy, etc.)")
print()
print("2. CHEMICAL CLASS ENCODING:")
print("   - Group solvents by chemical class (alcohols, ethers, etc.)")
print("   - Train models that generalize within classes")
print()
print("3. UNCERTAINTY-WEIGHTED PREDICTIONS:")
print("   - Use GP or ensemble variance to identify low-confidence predictions")
print("   - Down-weight predictions for solvents far from training distribution")
print()
print("4. SIMPLER MODELS:")
print("   - Ridge regression had worse CV but might have smaller CV-LB gap")
print("   - Linear models generalize better to OOD data")

=== GENERALIZATION STRATEGY ===

The problem: Test set has UNSEEN solvents
Our models memorize training solvents, don't generalize

Solution approaches:

1. SOLVENT-AGNOSTIC FEATURES:
   - Focus on reaction kinetics (Arrhenius) - already doing this
   - Add more physics-based features (activation energy, etc.)

2. CHEMICAL CLASS ENCODING:
   - Group solvents by chemical class (alcohols, ethers, etc.)
   - Train models that generalize within classes

3. UNCERTAINTY-WEIGHTED PREDICTIONS:
   - Use GP or ensemble variance to identify low-confidence predictions
   - Down-weight predictions for solvents far from training distribution

4. SIMPLER MODELS:
   - Ridge regression had worse CV but might have smaller CV-LB gap
   - Linear models generalize better to OOD data


In [10]:
# Final recommendation
print("=== FINAL RECOMMENDATION ===")
print()
print("Given:")
print("- 2 submissions remaining")
print("- Best LB is 0.0956, target is 0.01727 (5.5x gap)")
print("- Tree-based models have hit their ceiling")
print("- The 53% CV-LB gap suggests OOD generalization problem")
print()
print("NEXT EXPERIMENT: Try a fundamentally different approach")
print()
print("Option 1: Simple MLP with strong regularization")
print("- MLP can learn non-linear patterns that trees miss")
print("- Strong dropout/weight decay for generalization")
print("- Use ALL feature sets (Spange + ACS_PCA + DRFP)")
print()
print("Option 2: Ensemble with uncertainty weighting")
print("- Train multiple diverse models")
print("- Weight predictions by model confidence")
print("- Down-weight predictions for OOD solvents")
print()
print("Option 3: Focus on reducing CV-LB gap")
print("- Use simpler models (Ridge, linear)")
print("- Add regularization to existing best model")
print("- Test if simpler models have smaller gap")
print()
print("PRIORITY: Option 1 (MLP) - different model family, may capture different patterns")

=== FINAL RECOMMENDATION ===

Given:
- 2 submissions remaining
- Best LB is 0.0956, target is 0.01727 (5.5x gap)
- Tree-based models have hit their ceiling
- The 53% CV-LB gap suggests OOD generalization problem

NEXT EXPERIMENT: Try a fundamentally different approach

Option 1: Simple MLP with strong regularization
- MLP can learn non-linear patterns that trees miss
- Strong dropout/weight decay for generalization
- Use ALL feature sets (Spange + ACS_PCA + DRFP)

Option 2: Ensemble with uncertainty weighting
- Train multiple diverse models
- Weight predictions by model confidence
- Down-weight predictions for OOD solvents

Option 3: Focus on reducing CV-LB gap
- Use simpler models (Ridge, linear)
- Add regularization to existing best model
- Test if simpler models have smaller gap

PRIORITY: Option 1 (MLP) - different model family, may capture different patterns
