# Loop 7 Strategy Analysis

## Key Observations

### Current State
- **Best CV**: 0.0623 (exp_004, Per-target HGB+ETR NO TTA)
- **Best LB**: 0.0956 (exp_004)
- **CV-LB Gap**: 53% worse on LB
- **Target**: 0.01727 (5.5x better than our best LB)

### Critical Insights from Kernels

1. **lishellliang kernel** uses GroupKFold (5-fold) instead of Leave-One-Out
2. **Arrhenius kernel** (0.09831 LB) uses TTA for mixed solvents - but our analysis showed TTA HURTS!
3. **System Malfunction V1** (29 votes) uses simple MLP with Spange features

### The Fundamental Problem
The 5.5x gap to target suggests we're missing something fundamental. Possible causes:
1. Our models don't generalize to chemically unique solvents
2. The test set has very different solvents than training
3. We need fundamentally different approaches (GP, transfer learning)

In [1]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

DATA_PATH = '/home/data'

# Load data
df_single = pd.read_csv(f'{DATA_PATH}/catechol_single_solvent_yields.csv')
df_full = pd.read_csv(f'{DATA_PATH}/catechol_full_data_yields.csv')

print(f"Single solvent: {len(df_single)} samples, {df_single['SOLVENT NAME'].nunique()} solvents")
print(f"Full data: {len(df_full)} samples")
print(f"\nSolvents in single: {sorted(df_single['SOLVENT NAME'].unique())}")

Single solvent: 656 samples, 24 solvents
Full data: 1227 samples

Solvents in single: ['1,1,1,3,3,3-Hexafluoropropan-2-ol', '2,2,2-Trifluoroethanol', '2-Methyltetrahydrofuran [2-MeTHF]', 'Acetonitrile', 'Acetonitrile.Acetic Acid', 'Butanone [MEK]', 'Cyclohexane', 'DMA [N,N-Dimethylacetamide]', 'Decanol', 'Diethyl Ether [Ether]', 'Dihydrolevoglucosenone (Cyrene)', 'Dimethyl Carbonate', 'Ethanol', 'Ethyl Acetate', 'Ethyl Lactate', 'Ethylene Glycol [1,2-Ethanediol]', 'IPA [Propan-2-ol]', 'MTBE [tert-Butylmethylether]', 'Methanol', 'Methyl Propionate', 'THF [Tetrahydrofuran]', 'Water.2,2,2-Trifluoroethanol', 'Water.Acetonitrile', 'tert-Butanol [2-Methylpropan-2-ol]']


In [2]:
# Analyze the CV-LB gap
# Our CV is 0.0623, LB is 0.0956
# This means test solvents are MUCH harder than CV solvents

# Load Spange descriptors to analyze solvent similarity
spange = pd.read_csv(f'{DATA_PATH}/spange_descriptors_lookup.csv', index_col=0)
print(f"Spange descriptors: {spange.shape}")
print(f"\nSolvents in Spange: {sorted(spange.index.tolist())}")

Spange descriptors: (26, 13)

Solvents in Spange: ['1,1,1,3,3,3-Hexafluoropropan-2-ol', '2,2,2-Trifluoroethanol', '2-Methyltetrahydrofuran [2-MeTHF]', 'Acetic Acid', 'Acetonitrile', 'Acetonitrile.Acetic Acid', 'Butanone [MEK]', 'Cyclohexane', 'DMA [N,N-Dimethylacetamide]', 'Decanol', 'Diethyl Ether [Ether]', 'Dihydrolevoglucosenone (Cyrene)', 'Dimethyl Carbonate', 'Ethanol', 'Ethyl Acetate', 'Ethyl Lactate', 'Ethylene Glycol [1,2-Ethanediol]', 'IPA [Propan-2-ol]', 'MTBE [tert-Butylmethylether]', 'Methanol', 'Methyl Propionate', 'THF [Tetrahydrofuran]', 'Water', 'Water.2,2,2-Trifluoroethanol', 'Water.Acetonitrile', 'tert-Butanol [2-Methylpropan-2-ol]']


In [3]:
# Calculate pairwise distances between solvents
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.preprocessing import StandardScaler

# Standardize features
scaler = StandardScaler()
spange_scaled = scaler.fit_transform(spange.values)
spange_scaled_df = pd.DataFrame(spange_scaled, index=spange.index)

# Calculate distances
dist_matrix = euclidean_distances(spange_scaled)
dist_df = pd.DataFrame(dist_matrix, index=spange.index, columns=spange.index)

# Find most unique solvents (highest average distance to others)
avg_distances = dist_df.mean(axis=1).sort_values(ascending=False)
print("Most chemically unique solvents (highest avg distance):")
print(avg_distances.head(10))

Most chemically unique solvents (highest avg distance):
SOLVENT NAME
Water                                7.512371
1,1,1,3,3,3-Hexafluoropropan-2-ol    7.291879
Cyclohexane                          6.211928
Water.2,2,2-Trifluoroethanol         5.985620
2,2,2-Trifluoroethanol               5.823795
Water.Acetonitrile                   5.522566
Ethylene Glycol [1,2-Ethanediol]     4.893534
DMA [N,N-Dimethylacetamide]          4.744892
Decanol                              4.708360
Dihydrolevoglucosenone (Cyrene)      4.697719
dtype: float64


In [4]:
# Compare with our hardest solvents from CV
# From previous analysis: HFIP (0.145 MAE), Ethylene Glycol (0.122), Water.Acetonitrile (0.112)

hardest_cv = ['HFIP', 'Ethylene Glycol', 'Water.Acetonitrile']
print("\nHardest solvents from CV vs their uniqueness rank:")
for s in hardest_cv:
    if s in avg_distances.index:
        rank = list(avg_distances.index).index(s) + 1
        print(f"  {s}: rank {rank}/{len(avg_distances)}, avg_dist = {avg_distances[s]:.3f}")


Hardest solvents from CV vs their uniqueness rank:
  Water.Acetonitrile: rank 6/26, avg_dist = 5.523


In [5]:
# Key insight: The test set likely contains solvents that are even MORE unique
# than our hardest CV solvents. This explains the 53% CV-LB gap.

# Strategy: We need models that can extrapolate to unseen chemical space
# Options:
# 1. Gaussian Process with Tanimoto kernel (designed for molecular similarity)
# 2. Simpler models with stronger regularization (less overfitting)
# 3. Ensemble of diverse models (reduce variance)

print("\n=== STRATEGY ANALYSIS ===")
print("\n1. Current best approach: Per-target (HGB+ETR) NO TTA")
print("   - CV: 0.0623, LB: 0.0956 (53% gap)")
print("   - Problem: Overfits to training solvents")

print("\n2. Intermediate regularization (exp_007):")
print("   - CV: 0.0689 (worse than exp_004)")
print("   - Not submitted yet - unknown if LB is better")

print("\n3. Ridge baseline (exp_006):")
print("   - CV: 0.0896 (much worse)")
print("   - Too simple - underfits")

print("\n4. UNEXPLORED: Gaussian Process models")
print("   - Specifically designed for small datasets")
print("   - Can use Tanimoto kernel for molecular similarity")
print("   - Better uncertainty quantification")
print("   - May extrapolate better to unseen solvents")


=== STRATEGY ANALYSIS ===

1. Current best approach: Per-target (HGB+ETR) NO TTA
   - CV: 0.0623, LB: 0.0956 (53% gap)
   - Problem: Overfits to training solvents

2. Intermediate regularization (exp_007):
   - CV: 0.0689 (worse than exp_004)
   - Not submitted yet - unknown if LB is better

3. Ridge baseline (exp_006):
   - CV: 0.0896 (much worse)
   - Too simple - underfits

4. UNEXPLORED: Gaussian Process models
   - Specifically designed for small datasets
   - Can use Tanimoto kernel for molecular similarity
   - Better uncertainty quantification
   - May extrapolate better to unseen solvents


In [6]:
# Let's test if GP is feasible with our data size
# GP has O(n^3) complexity, so we need to check if it's tractable

n_single = len(df_single)
n_full = len(df_full)

print(f"Single solvent samples: {n_single}")
print(f"Full data samples: {n_full}")
print(f"\nGP complexity estimate:")
print(f"  Single: {n_single}^3 = {n_single**3:,} operations per fold")
print(f"  Full: {n_full}^3 = {n_full**3:,} operations per fold")
print(f"\nWith 24 folds for single and 13 for full, this is tractable!")

Single solvent samples: 656
Full data samples: 1227

GP complexity estimate:
  Single: 656^3 = 282,300,416 operations per fold
  Full: 1227^3 = 1,847,284,083 operations per fold

With 24 folds for single and 13 for full, this is tractable!


In [7]:
# Test a simple GP with RBF kernel on Spange features
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, Matern, WhiteKernel, ConstantKernel

# Prepare single solvent data
TARGET_LABELS = ["Product 2", "Product 3", "SM"]
X_single = df_single[["Residence Time", "Temperature", "SOLVENT NAME"]]
Y_single = df_single[TARGET_LABELS]

# Build features
def build_features_single(X, spange_df):
    rt = X['Residence Time'].values.reshape(-1, 1)
    temp = X['Temperature'].values.reshape(-1, 1)
    temp_k = temp + 273.15
    inv_temp = 1000.0 / temp_k
    log_time = np.log(rt + 1e-6)
    interaction = inv_temp * log_time
    
    spange_feats = spange_df.loc[X['SOLVENT NAME']].values
    return np.hstack([rt, temp, inv_temp, log_time, interaction, spange_feats])

X_feat = build_features_single(X_single, spange)
print(f"Feature matrix shape: {X_feat.shape}")

Feature matrix shape: (656, 18)


In [8]:
# Quick GP test on first 3 folds
from sklearn.preprocessing import StandardScaler

def generate_leave_one_out_splits(X, Y):
    for solvent in sorted(X["SOLVENT NAME"].unique()):
        mask = X["SOLVENT NAME"] != solvent
        yield (X[mask], Y[mask]), (X[~mask], Y[~mask])

# Test GP with Matern kernel
kernel = ConstantKernel(1.0) * Matern(length_scale=1.0, nu=2.5) + WhiteKernel(noise_level=0.1)

errors = []
for i, ((train_X, train_Y), (test_X, test_Y)) in enumerate(generate_leave_one_out_splits(X_single, Y_single)):
    if i >= 3:  # Quick test
        break
    
    # Build features
    X_train_feat = build_features_single(train_X, spange)
    X_test_feat = build_features_single(test_X, spange)
    
    # Scale
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train_feat)
    X_test_scaled = scaler.transform(X_test_feat)
    
    # Train GP for each target
    preds = []
    for j, target in enumerate(TARGET_LABELS):
        gp = GaussianProcessRegressor(kernel=kernel, alpha=0.1, normalize_y=True, n_restarts_optimizer=2)
        gp.fit(X_train_scaled, train_Y[target].values)
        pred = gp.predict(X_test_scaled)
        preds.append(pred)
    
    preds = np.column_stack(preds)
    mae = np.mean(np.abs(preds - test_Y.values))
    errors.append(mae)
    solvent = test_X['SOLVENT NAME'].iloc[0]
    print(f"Fold {i} ({solvent}): MAE = {mae:.4f}")

print(f"\nGP Quick Test MAE: {np.mean(errors):.4f}")

Fold 0 (1,1,1,3,3,3-Hexafluoropropan-2-ol): MAE = 0.1738


Fold 1 (2,2,2-Trifluoroethanol): MAE = 0.0718


Fold 2 (2-Methyltetrahydrofuran [2-MeTHF]): MAE = 0.0508

GP Quick Test MAE: 0.0988


In [9]:
# Compare with our best model (ETR)
from sklearn.ensemble import ExtraTreesRegressor

errors_etr = []
for i, ((train_X, train_Y), (test_X, test_Y)) in enumerate(generate_leave_one_out_splits(X_single, Y_single)):
    if i >= 3:
        break
    
    X_train_feat = build_features_single(train_X, spange)
    X_test_feat = build_features_single(test_X, spange)
    
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train_feat)
    X_test_scaled = scaler.transform(X_test_feat)
    
    preds = []
    for j, target in enumerate(TARGET_LABELS):
        etr = ExtraTreesRegressor(n_estimators=200, max_depth=10, min_samples_leaf=2, random_state=42, n_jobs=-1)
        etr.fit(X_train_scaled, train_Y[target].values)
        pred = etr.predict(X_test_scaled)
        preds.append(pred)
    
    preds = np.column_stack(preds)
    mae = np.mean(np.abs(preds - test_Y.values))
    errors_etr.append(mae)
    solvent = test_X['SOLVENT NAME'].iloc[0]
    print(f"Fold {i} ({solvent}): MAE = {mae:.4f}")

print(f"\nETR Quick Test MAE: {np.mean(errors_etr):.4f}")
print(f"GP Quick Test MAE: {np.mean(errors):.4f}")

Fold 0 (1,1,1,3,3,3-Hexafluoropropan-2-ol): MAE = 0.1729


Fold 1 (2,2,2-Trifluoroethanol): MAE = 0.1131


Fold 2 (2-Methyltetrahydrofuran [2-MeTHF]): MAE = 0.0299

ETR Quick Test MAE: 0.1053
GP Quick Test MAE: 0.0988


## Key Findings

### What We've Learned
1. **CV-LB Gap is 53%** - Our models overfit to training solvents
2. **Hardest solvents** (HFIP, Ethylene Glycol) are chemically unique
3. **TTA hurts** mixed solvent performance
4. **Intermediate regularization** (exp_007) has worse CV but might have better LB

### Unexplored Approaches
1. **Gaussian Process** with molecular kernels
2. **Neural Network** with stronger regularization (dropout, weight decay)
3. **Ensemble of diverse models** (GP + ETR + MLP)
4. **Feature selection** to reduce overfitting

### Recommended Next Steps
1. **Submit exp_007** to test if intermediate regularization helps LB
2. **Try GP model** with Matern kernel on Spange features
3. **Try MLP** with strong regularization (like top kernels)
4. **Ensemble** diverse models for final submission

In [10]:
# Summary of approaches to try
print("=" * 60)
print("PRIORITY APPROACHES FOR NEXT EXPERIMENTS")
print("=" * 60)

print("\n1. SUBMIT exp_007 (intermediate regularization)")
print("   - CV: 0.0689")
print("   - Test if regularization helps CV-LB gap")
print("   - Uses 1 submission")

print("\n2. Gaussian Process with Matern kernel")
print("   - Better extrapolation to unseen solvents")
print("   - Uncertainty quantification")
print("   - May have smaller CV-LB gap")

print("\n3. MLP with strong regularization")
print("   - Like top kernels (System Malfunction, lishellliang)")
print("   - Dropout 0.2-0.3, weight decay 1e-4")
print("   - BatchNorm for stability")

print("\n4. Ensemble of diverse models")
print("   - GP + ETR + MLP")
print("   - Weighted average or stacking")
print("   - Reduce variance")

print("\n5. Per-target GP models")
print("   - Different kernels for SM vs Products")
print("   - SM might need different treatment (starting material)")

print("\n" + "=" * 60)
print("TARGET: 0.01727 (5.5x better than current best LB 0.0956)")
print("=" * 60)

PRIORITY APPROACHES FOR NEXT EXPERIMENTS

1. SUBMIT exp_007 (intermediate regularization)
   - CV: 0.0689
   - Test if regularization helps CV-LB gap
   - Uses 1 submission

2. Gaussian Process with Matern kernel
   - Better extrapolation to unseen solvents
   - Uncertainty quantification
   - May have smaller CV-LB gap

3. MLP with strong regularization
   - Like top kernels (System Malfunction, lishellliang)
   - Dropout 0.2-0.3, weight decay 1e-4
   - BatchNorm for stability

4. Ensemble of diverse models
   - GP + ETR + MLP
   - Weighted average or stacking
   - Reduce variance

5. Per-target GP models
   - Different kernels for SM vs Products
   - SM might need different treatment (starting material)

TARGET: 0.01727 (5.5x better than current best LB 0.0956)
