# Loop 17 LB Feedback Analysis

**exp_017 submitted:** CV 0.0623 â†’ LB 0.0956 (53% gap)

This confirms exp_017 is identical to exp_004 (same CV, same LB).

## Key Questions:
1. Why is there a 53% CV-LB gap?
2. What approaches could reduce this gap?
3. What haven't we tried yet?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load data
DATA_PATH = '/home/data'

# Load all datasets
full_data = pd.read_csv(f'{DATA_PATH}/catechol_full_data_yields.csv')
single_data = pd.read_csv(f'{DATA_PATH}/catechol_single_solvent_yields.csv')
spange = pd.read_csv(f'{DATA_PATH}/spange_descriptors_lookup.csv', index_col=0)
acs_pca = pd.read_csv(f'{DATA_PATH}/acs_pca_descriptors_lookup.csv', index_col=0)

print('=== Dataset Sizes ===')
print(f'Full data: {len(full_data)} rows')
print(f'Single solvent: {len(single_data)} rows')
print(f'Spange descriptors: {spange.shape}')
print(f'ACS PCA descriptors: {acs_pca.shape}')

print('\n=== Solvents ===')
print(f'Single solvent unique: {single_data["SOLVENT NAME"].nunique()}')
print(f'Full data unique A: {full_data["SOLVENT A NAME"].nunique()}')
print(f'Full data unique B: {full_data["SOLVENT B NAME"].nunique()}')

print('\n=== Solvent Names ===')
print('Single:', sorted(single_data['SOLVENT NAME'].unique()))
print('Full A:', sorted(full_data['SOLVENT A NAME'].unique()))
print('Full B:', sorted(full_data['SOLVENT B NAME'].unique()))

=== Dataset Sizes ===
Full data: 1227 rows
Single solvent: 656 rows
Spange descriptors: (26, 13)
ACS PCA descriptors: (24, 5)

=== Solvents ===
Single solvent unique: 24
Full data unique A: 13
Full data unique B: 13

=== Solvent Names ===
Single: ['1,1,1,3,3,3-Hexafluoropropan-2-ol', '2,2,2-Trifluoroethanol', '2-Methyltetrahydrofuran [2-MeTHF]', 'Acetonitrile', 'Acetonitrile.Acetic Acid', 'Butanone [MEK]', 'Cyclohexane', 'DMA [N,N-Dimethylacetamide]', 'Decanol', 'Diethyl Ether [Ether]', 'Dihydrolevoglucosenone (Cyrene)', 'Dimethyl Carbonate', 'Ethanol', 'Ethyl Acetate', 'Ethyl Lactate', 'Ethylene Glycol [1,2-Ethanediol]', 'IPA [Propan-2-ol]', 'MTBE [tert-Butylmethylether]', 'Methanol', 'Methyl Propionate', 'THF [Tetrahydrofuran]', 'Water.2,2,2-Trifluoroethanol', 'Water.Acetonitrile', 'tert-Butanol [2-Methylpropan-2-ol]']
Full A: ['1,1,1,3,3,3-Hexafluoropropan-2-ol', '2,2,2-Trifluoroethanol', '2-Methyltetrahydrofuran [2-MeTHF]', 'Acetonitrile', 'Cyclohexane', 'DMA [N,N-Dimethylacetamide

In [2]:
# Analyze CV-LB gap pattern
submissions = [
    {'exp': 'exp_004', 'cv': 0.0623, 'lb': 0.0956, 'model': 'HGB+ETR Per-Target'},
    {'exp': 'exp_006', 'cv': 0.0688, 'lb': 0.0991, 'model': 'Ridge (alpha=10)'},
    {'exp': 'exp_017', 'cv': 0.0623, 'lb': 0.0956, 'model': 'HGB+ETR Per-Target (replicate)'},
]

df_sub = pd.DataFrame(submissions)
df_sub['gap'] = (df_sub['lb'] - df_sub['cv']) / df_sub['cv'] * 100
df_sub['gap_abs'] = df_sub['lb'] - df_sub['cv']

print('=== Submission Analysis ===')
print(df_sub.to_string(index=False))

print('\n=== Key Observations ===')
print(f'Average CV-LB gap: {df_sub["gap"].mean():.1f}%')
print(f'Average absolute gap: {df_sub["gap_abs"].mean():.4f}')
print(f'Best LB: {df_sub["lb"].min():.4f}')
print(f'Target: 0.01727')
print(f'Gap to target: {(df_sub["lb"].min() - 0.01727) / 0.01727 * 100:.1f}%')

=== Submission Analysis ===
    exp     cv     lb                          model       gap  gap_abs
exp_004 0.0623 0.0956             HGB+ETR Per-Target 53.451043   0.0333
exp_006 0.0688 0.0991               Ridge (alpha=10) 44.040698   0.0303
exp_017 0.0623 0.0956 HGB+ETR Per-Target (replicate) 53.451043   0.0333

=== Key Observations ===
Average CV-LB gap: 50.3%
Average absolute gap: 0.0323
Best LB: 0.0956
Target: 0.01727
Gap to target: 453.6%


In [3]:
# What approaches have been tried?
experiments = [
    ('001', 'Ensemble (MLP+XGB+LGB+RF)', 0.0814, 'TTA'),
    ('002', 'Ensemble + Poly Features', 0.0810, 'TTA'),
    ('003', 'RandomForest Regularized', 0.0805, 'No TTA'),
    ('004', 'HGB+ETR Per-Target', 0.0623, 'No TTA, Prediction Combination'),
    ('005', 'HGB+ETR Per-Target NO TTA', 0.0623, 'BEST - No TTA'),
    ('006', 'Ridge Baseline', 0.0896, 'Simple linear'),
    ('007', 'HGB+ETR depth=5/7', 0.0688, 'Intermediate regularization'),
    ('008', 'Gaussian Process', 0.0721, 'Matern kernel'),
    ('009', 'Diverse Ensemble', 0.0673, 'PerTarget+RF+XGB+LGB'),
    ('010', 'MLP+GBDT Ensemble', 0.0669, 'Like top kernel'),
    ('011', 'GroupKFold Ensemble', 0.0841, 'FAILED - wrong fold structure'),
    ('012', 'Template-Compliant GroupKFold', 0.0844, 'FAILED - wrong fold structure'),
    ('013', 'LOO Ensemble', 0.0827, 'MLP+GBDT with LOO'),
    ('014', 'Optuna Per-Target', 0.0834, 'Found shallow models - underfit'),
    ('015', 'MLP Per-Target Combined', 0.0891, 'Deep models overfit'),
    ('016', 'Hybrid Task-Specific', 0.0830, 'WRONG - Feature combination'),
    ('017', 'Replicate exp_004', 0.0623, 'Correct - Prediction combination'),
]

print('=== All Experiments ===')
for exp, model, cv, notes in experiments:
    print(f'{exp}: CV={cv:.4f} | {model} | {notes}')

print('\n=== Best Approaches ===')
print('1. exp_004/005/017: HGB+ETR Per-Target with Prediction Combination (CV 0.0623)')
print('2. exp_010: MLP+GBDT Ensemble (CV 0.0669)')
print('3. exp_009: Diverse Ensemble (CV 0.0673)')

=== All Experiments ===
001: CV=0.0814 | Ensemble (MLP+XGB+LGB+RF) | TTA
002: CV=0.0810 | Ensemble + Poly Features | TTA
003: CV=0.0805 | RandomForest Regularized | No TTA
004: CV=0.0623 | HGB+ETR Per-Target | No TTA, Prediction Combination
005: CV=0.0623 | HGB+ETR Per-Target NO TTA | BEST - No TTA
006: CV=0.0896 | Ridge Baseline | Simple linear
007: CV=0.0688 | HGB+ETR depth=5/7 | Intermediate regularization
008: CV=0.0721 | Gaussian Process | Matern kernel
009: CV=0.0673 | Diverse Ensemble | PerTarget+RF+XGB+LGB
010: CV=0.0669 | MLP+GBDT Ensemble | Like top kernel
011: CV=0.0841 | GroupKFold Ensemble | FAILED - wrong fold structure
012: CV=0.0844 | Template-Compliant GroupKFold | FAILED - wrong fold structure
013: CV=0.0827 | LOO Ensemble | MLP+GBDT with LOO
014: CV=0.0834 | Optuna Per-Target | Found shallow models - underfit
015: CV=0.0891 | MLP Per-Target Combined | Deep models overfit
016: CV=0.0830 | Hybrid Task-Specific | WRONG - Feature combination
017: CV=0.0623 | Replicate ex

In [4]:
# What HASN'T been tried?
print('=== APPROACHES NOT YET TRIED ===')
print()
print('1. DIFFERENT FEATURE ENGINEERING:')
print('   - Reaction SMILES encoding (not just solvent descriptors)')
print('   - Morgan fingerprints / ECFP for solvents')
print('   - Quantum-chemical features (COSMO-RS style)')
print('   - Spectroscopy-guided features')
print()
print('2. DIFFERENT MODEL ARCHITECTURES:')
print('   - Graph Neural Networks (GNN) for molecular structure')
print('   - Transformer-based models for reaction SMILES')
print('   - Deep kernel learning (GP + neural network)')
print()
print('3. TRAINING STRATEGIES:')
print('   - Transfer learning from larger chemical datasets')
print('   - Meta-learning for OOD generalization')
print('   - Uncertainty-aware predictions')
print()
print('4. ENSEMBLE STRATEGIES:')
print('   - Stacking with meta-learner')
print('   - Blending with learned weights')
print('   - Ensemble of diverse feature sets')
print()
print('5. REGULARIZATION APPROACHES:')
print('   - Domain adversarial training')
print('   - Mixup / data augmentation')
print('   - Dropout ensemble')

=== APPROACHES NOT YET TRIED ===

1. DIFFERENT FEATURE ENGINEERING:
   - Reaction SMILES encoding (not just solvent descriptors)
   - Morgan fingerprints / ECFP for solvents
   - Quantum-chemical features (COSMO-RS style)
   - Spectroscopy-guided features

2. DIFFERENT MODEL ARCHITECTURES:
   - Graph Neural Networks (GNN) for molecular structure
   - Transformer-based models for reaction SMILES
   - Deep kernel learning (GP + neural network)

3. TRAINING STRATEGIES:
   - Transfer learning from larger chemical datasets
   - Meta-learning for OOD generalization
   - Uncertainty-aware predictions

4. ENSEMBLE STRATEGIES:
   - Stacking with meta-learner
   - Blending with learned weights
   - Ensemble of diverse feature sets

5. REGULARIZATION APPROACHES:
   - Domain adversarial training
   - Mixup / data augmentation
   - Dropout ensemble


In [5]:
# Analyze what's special about the test set
# The 53% CV-LB gap suggests the test set has fundamentally different solvents

print('=== HYPOTHESIS: Test Set Has Unseen Solvents ===')
print()
print('Evidence:')
print('1. CV-LB gap is ~53% for all models')
print('2. More regularization made LB WORSE (exp_006: 0.0991 vs exp_004: 0.0956)')
print('3. This suggests the problem is NOT traditional overfitting')
print('4. The test set likely contains solvents that are chemically different')
print()
print('Implications:')
print('1. Need models that generalize to unseen solvents')
print('2. Focus on chemical/physics features, not solvent-specific patterns')
print('3. Consider transfer learning or meta-learning approaches')
print()
print('=== WHAT COULD HELP ===')
print('1. Physics-informed features (polarity, H-bonding, dielectric constant)')
print('2. Solvent similarity weighting (down-weight distant solvents)')
print('3. Uncertainty-aware predictions (GP or ensemble variance)')
print('4. Transfer learning from larger chemical datasets')

=== HYPOTHESIS: Test Set Has Unseen Solvents ===

Evidence:
1. CV-LB gap is ~53% for all models
2. More regularization made LB WORSE (exp_006: 0.0991 vs exp_004: 0.0956)
3. This suggests the problem is NOT traditional overfitting
4. The test set likely contains solvents that are chemically different

Implications:
1. Need models that generalize to unseen solvents
2. Focus on chemical/physics features, not solvent-specific patterns
3. Consider transfer learning or meta-learning approaches

=== WHAT COULD HELP ===
1. Physics-informed features (polarity, H-bonding, dielectric constant)
2. Solvent similarity weighting (down-weight distant solvents)
3. Uncertainty-aware predictions (GP or ensemble variance)
4. Transfer learning from larger chemical datasets


In [6]:
# Check available features in the data
print('=== Available Feature Sets ===')
print()
print('Spange descriptors:', list(spange.columns))
print()
print('ACS PCA descriptors:', list(acs_pca.columns))
print()
print('Numeric features in data:')
print('- Residence Time')
print('- Temperature')
print('- SolventB% (for mixed)')
print()
print('=== Feature Engineering Ideas ===')
print('1. Arrhenius kinetics: inv_temp = 1000/(T+273.15), log_time = ln(t)')
print('2. Interaction: inv_temp * log_time')
print('3. Polynomial: temp^2, time^2, temp*time')
print('4. Solvent mixture: weighted average of descriptors')
print()
print('=== What exp_004 uses ===')
print('- Arrhenius kinetics features (inv_temp, log_time, interaction)')
print('- Spange + ACS_PCA descriptors')
print('- Prediction combination: 0.8*acs_pred + 0.2*spange_pred')

=== Available Feature Sets ===

Spange descriptors: ['dielectric constant', 'ET(30)', 'alpha', 'beta', 'pi*', 'SA', 'SB', 'SP', 'SdP', 'N', 'n', 'f(n)', 'delta']

ACS PCA descriptors: ['PC1', 'PC2', 'PC3', 'PC4', 'PC5']

Numeric features in data:
- Residence Time
- Temperature
- SolventB% (for mixed)

=== Feature Engineering Ideas ===
1. Arrhenius kinetics: inv_temp = 1000/(T+273.15), log_time = ln(t)
2. Interaction: inv_temp * log_time
3. Polynomial: temp^2, time^2, temp*time
4. Solvent mixture: weighted average of descriptors

=== What exp_004 uses ===
- Arrhenius kinetics features (inv_temp, log_time, interaction)
- Spange + ACS_PCA descriptors
- Prediction combination: 0.8*acs_pred + 0.2*spange_pred


In [7]:
# Final analysis: What's the path forward?
print('=== PATH FORWARD ===')
print()
print('CURRENT STATUS:')
print('- Best CV: 0.0623 (exp_004/017)')
print('- Best LB: 0.0956 (exp_004/017)')
print('- Target: 0.01727')
print('- Gap to target: 5.5x')
print('- Submissions remaining: 2')
print()
print('KEY INSIGHT:')
print('The 53% CV-LB gap is consistent across models.')
print('This suggests the test set has fundamentally different solvents.')
print('Incremental improvements to CV will NOT close the 5.5x gap to target.')
print()
print('STRATEGIC OPTIONS:')
print()
print('Option A: ENSEMBLE DIVERSITY (Low risk, small improvement)')
print('- Combine exp_004 predictions with other diverse models')
print('- Expected improvement: 5-10%')
print('- Risk: May not improve LB if all models fail on same solvents')
print()
print('Option B: FEATURE ENGINEERING (Medium risk, medium improvement)')
print('- Add physics-informed features (polarity, H-bonding)')
print('- Use Morgan fingerprints for solvents')
print('- Expected improvement: 10-20%')
print('- Risk: May not capture the right chemical properties')
print()
print('Option C: FUNDAMENTALLY DIFFERENT APPROACH (High risk, high reward)')
print('- Graph Neural Networks for molecular structure')
print('- Transformer-based models for reaction SMILES')
print('- Transfer learning from larger chemical datasets')
print('- Expected improvement: 50%+ (if successful)')
print('- Risk: May not work, requires significant implementation effort')
print()
print('RECOMMENDATION:')
print('With 2 submissions remaining, we should:')
print('1. Try Option B (Feature Engineering) - medium risk, medium reward')
print('2. If that fails, try Option C (GNN/Transformer) - high risk, high reward')
print('3. DO NOT submit exp_017 again - it\'s identical to exp_004')

=== PATH FORWARD ===

CURRENT STATUS:
- Best CV: 0.0623 (exp_004/017)
- Best LB: 0.0956 (exp_004/017)
- Target: 0.01727
- Gap to target: 5.5x
- Submissions remaining: 2

KEY INSIGHT:
The 53% CV-LB gap is consistent across models.
This suggests the test set has fundamentally different solvents.
Incremental improvements to CV will NOT close the 5.5x gap to target.

STRATEGIC OPTIONS:

Option A: ENSEMBLE DIVERSITY (Low risk, small improvement)
- Combine exp_004 predictions with other diverse models
- Expected improvement: 5-10%
- Risk: May not improve LB if all models fail on same solvents

Option B: FEATURE ENGINEERING (Medium risk, medium improvement)
- Add physics-informed features (polarity, H-bonding)
- Use Morgan fingerprints for solvents
- Expected improvement: 10-20%
- Risk: May not capture the right chemical properties

Option C: FUNDAMENTALLY DIFFERENT APPROACH (High risk, high reward)
- Graph Neural Networks for molecular structure
- Transformer-based models for reaction SMILES