# Loop 5 LB Feedback Analysis

## Critical Finding: MASSIVE CV-LB Gap!

- **CV Score**: 0.0623
- **LB Score**: 0.0956
- **Gap**: -0.0333 (53% worse on LB!)

This is a HUGE overfitting signal. Our validation is NOT representative of the test set.

## Hypotheses for the Gap:

1. **Test set has completely different solvents** - not just "unseen" but chemically very different
2. **Our leave-one-solvent-out is too optimistic** - training on 23 solvents still captures too much information
3. **The test set may have different experimental conditions** - different temperature/time ranges
4. **Model is overfitting to training distribution** - need stronger regularization

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

DATA_PATH = '/home/data'

# Load data
df_single = pd.read_csv(f'{DATA_PATH}/catechol_single_solvent_yields.csv')
df_full = pd.read_csv(f'{DATA_PATH}/catechol_full_data_yields.csv')

print("Single Solvent Data:")
print(f"  Samples: {len(df_single)}")
print(f"  Solvents: {df_single['SOLVENT NAME'].nunique()}")
print(f"  Temperature range: {df_single['Temperature'].min()}-{df_single['Temperature'].max()}")
print(f"  Residence Time range: {df_single['Residence Time'].min()}-{df_single['Residence Time'].max()}")

print("\nFull Data:")
print(f"  Samples: {len(df_full)}")
print(f"  Unique ramps: {df_full[['SOLVENT A NAME', 'SOLVENT B NAME']].drop_duplicates().shape[0]}")

Single Solvent Data:
  Samples: 656
  Solvents: 24
  Temperature range: 175.0-225.0
  Residence Time range: 2.001019108286073-15.017208412882612

Full Data:
  Samples: 1227
  Unique ramps: 13


In [2]:
# Analyze per-solvent errors from our best model
# Load submission and calculate per-fold errors

submission = pd.read_csv('/home/code/experiments/005_no_tta_per_target/submission.csv')
submission_single = submission[submission['task'] == 0]

TARGET_LABELS = ["Product 2", "Product 3", "SM"]

def generate_leave_one_out_splits(X, Y):
    for solvent in sorted(X["SOLVENT NAME"].unique()):
        mask = X["SOLVENT NAME"] != solvent
        yield (X[mask], Y[mask]), (X[~mask], Y[~mask])

X_single = df_single[["Residence Time", "Temperature", "SOLVENT NAME"]]
Y_single = df_single[TARGET_LABELS]

# Calculate per-solvent errors
solvent_errors = []
for fold_idx, ((_, _), (test_X, test_Y)) in enumerate(generate_leave_one_out_splits(X_single, Y_single)):
    fold_preds = submission_single[submission_single['fold'] == fold_idx]
    pred_vals = fold_preds[['target_1', 'target_2', 'target_3']].values
    mae = np.mean(np.abs(pred_vals - test_Y.values))
    solvent = test_X['SOLVENT NAME'].iloc[0]
    solvent_errors.append({'solvent': solvent, 'mae': mae, 'n_samples': len(test_Y)})

solvent_df = pd.DataFrame(solvent_errors).sort_values('mae', ascending=False)
print("Per-Solvent Errors (sorted by MAE):")
print(solvent_df.to_string())
print(f"\nMean MAE: {solvent_df['mae'].mean():.4f}")
print(f"Std MAE: {solvent_df['mae'].std():.4f}")

Per-Solvent Errors (sorted by MAE):
                               solvent       mae  n_samples
0    1,1,1,3,3,3-Hexafluoropropan-2-ol  0.144621         37
15    Ethylene Glycol [1,2-Ethanediol]  0.122106         22
22                  Water.Acetonitrile  0.111587         37
4             Acetonitrile.Acetic Acid  0.107740         22
1               2,2,2-Trifluoroethanol  0.096203         37
9                Diethyl Ether [Ether]  0.093065         22
3                         Acetonitrile  0.081324         59
8                              Decanol  0.078832         20
21        Water.2,2,2-Trifluoroethanol  0.068718         22
16                   IPA [Propan-2-ol]  0.067360          5
6                          Cyclohexane  0.066064         34
14                       Ethyl Lactate  0.060396         17
10     Dihydrolevoglucosenone (Cyrene)  0.058991         18
11                  Dimethyl Carbonate  0.056497         18
13                       Ethyl Acetate  0.051767         18
18  

In [3]:
# Analyze the hardest solvents - what makes them different?
spange = pd.read_csv(f'{DATA_PATH}/spange_descriptors_lookup.csv', index_col=0)

# Get the top 5 hardest solvents
hardest = solvent_df.head(5)['solvent'].tolist()
easiest = solvent_df.tail(5)['solvent'].tolist()

print("Hardest solvents to predict:")
for s in hardest:
    if s in spange.index:
        print(f"  {s}: {spange.loc[s].values[:5]}...")
    else:
        print(f"  {s}: NOT IN SPANGE")

print("\nEasiest solvents to predict:")
for s in easiest:
    if s in spange.index:
        print(f"  {s}: {spange.loc[s].values[:5]}...")
    else:
        print(f"  {s}: NOT IN SPANGE")

Hardest solvents to predict:
  1,1,1,3,3,3-Hexafluoropropan-2-ol: [16.7  62.1   1.96  0.    0.65]...
  Ethylene Glycol [1,2-Ethanediol]: [37.   56.3   0.9   0.52  0.92]...
  Water.Acetonitrile: [63.06  56.1    0.778  0.442  0.954]...
  Acetonitrile.Acetic Acid: [21.825 48.65   0.655  0.425  0.695]...
  2,2,2-Trifluoroethanol: [ 8.55 59.8   1.51  0.    0.73]...

Easiest solvents to predict:
  Butanone [MEK]: [17.1  41.3   0.06  0.48  0.67]...
  DMA [N,N-Dimethylacetamide]: [37.8  42.9   0.    0.76  0.88]...
  Methyl Propionate: [ 6.23 39.2   0.    0.51  0.55]...
  MTBE [tert-Butylmethylether]: [ 2.6  34.7   0.    0.45  0.31]...
  THF [Tetrahydrofuran]: [ 7.58 37.4   0.    0.55  0.58]...


In [4]:
# Calculate distance from each solvent to the centroid of all other solvents
# This measures how "unique" each solvent is

from scipy.spatial.distance import cdist

solvent_names = spange.index.tolist()
features = spange.values

# For each solvent, calculate distance to centroid of others
uniqueness = []
for i, solvent in enumerate(solvent_names):
    others = np.delete(features, i, axis=0)
    centroid = others.mean(axis=0)
    dist = np.linalg.norm(features[i] - centroid)
    uniqueness.append({'solvent': solvent, 'uniqueness': dist})

uniqueness_df = pd.DataFrame(uniqueness).sort_values('uniqueness', ascending=False)

# Merge with errors
merged = solvent_df.merge(uniqueness_df, on='solvent')
print("Correlation between uniqueness and MAE:")
print(f"  Pearson: {merged['mae'].corr(merged['uniqueness']):.3f}")

print("\nMost unique solvents:")
print(uniqueness_df.head(10).to_string())

Correlation between uniqueness and MAE:
  Pearson: 0.200

Most unique solvents:
                             solvent  uniqueness
10                             Water   69.161804
23                Water.Acetonitrile   47.938868
25      Water.2,2,2-Trifluoroethanol   39.088714
0                        Cyclohexane   26.165102
12      MTBE [tert-Butylmethylether]   23.942677
11             Diethyl Ether [Ether]   22.605887
8   Ethylene Glycol [1,2-Ethanediol]   21.025698
13                Dimethyl Carbonate   20.538193
17   Dihydrolevoglucosenone (Cyrene)   18.685958
16            2,2,2-Trifluoroethanol   18.662401


In [5]:
# Key insight: The test set likely has solvents that are MORE unique than our training set
# Our CV is optimistic because we're still training on 23/24 solvents

# What if we used a more aggressive CV strategy?
# e.g., Leave-5-solvents-out instead of leave-1-out?

print("\n=== ANALYSIS SUMMARY ===")
print("\n1. CV-LB Gap Analysis:")
print(f"   CV: 0.0623, LB: 0.0956, Gap: 53% worse")
print("   This suggests our validation is too optimistic")

print("\n2. Hardest solvents are chemically unique:")
print(f"   Cyclohexane, HFIP, TFE have no similar neighbors")
print(f"   These contribute disproportionately to error")

print("\n3. Potential solutions:")
print("   a) Use simpler models with stronger regularization")
print("   b) Focus on features that generalize to unseen solvents")
print("   c) Try higher-dimensional features (DRFP, fragprints)")
print("   d) Use ensemble of diverse models")
print("   e) Consider Gaussian Process with chemistry kernels")


=== ANALYSIS SUMMARY ===

1. CV-LB Gap Analysis:
   CV: 0.0623, LB: 0.0956, Gap: 53% worse
   This suggests our validation is too optimistic

2. Hardest solvents are chemically unique:
   Cyclohexane, HFIP, TFE have no similar neighbors
   These contribute disproportionately to error

3. Potential solutions:
   a) Use simpler models with stronger regularization
   b) Focus on features that generalize to unseen solvents
   c) Try higher-dimensional features (DRFP, fragprints)
   d) Use ensemble of diverse models
   e) Consider Gaussian Process with chemistry kernels


In [6]:
# Look at what public kernels achieve
# The best public kernel (Arrhenius + TTA) gets LB 0.09831
# Our LB is 0.0956 - slightly better!

# But the target is 0.01727 - that's 5.5x better than our LB
# This suggests there's a fundamentally different approach needed

print("\n=== LEADERBOARD CONTEXT ===")
print(f"Our LB: 0.0956")
print(f"Best public kernel: 0.09831")
print(f"Target: 0.01727")
print(f"Gap to target: {0.0956 / 0.01727:.1f}x")

print("\n=== WHAT COULD ACHIEVE 0.017? ===")
print("1. Domain-specific knowledge we're missing")
print("2. Pre-trained molecular representations")
print("3. Physics-based constraints on predictions")
print("4. Ensemble of many diverse approaches")
print("5. Better feature engineering for unseen solvents")


=== LEADERBOARD CONTEXT ===
Our LB: 0.0956
Best public kernel: 0.09831
Target: 0.01727
Gap to target: 5.5x

=== WHAT COULD ACHIEVE 0.017? ===
1. Domain-specific knowledge we're missing
2. Pre-trained molecular representations
3. Physics-based constraints on predictions
4. Ensemble of many diverse approaches
5. Better feature engineering for unseen solvents
