# Loop 34 Analysis: Understanding the CV-LB Gap

**Current State:**
- Best CV: 0.008465 (exp_026)
- Best LB: 0.0887 (exp_026)
- Target: 0.0347
- CV-LB Gap: ~10.5x

**Key Question:** Why is there such a large gap? What can we do to reduce it?

**Hypotheses to Test:**
1. Distribution shift between train/test solvents
2. Overfitting to specific solvents in CV
3. The test set has fundamentally different characteristics

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

DATA_PATH = '/home/data'

# Load all data
df_single = pd.read_csv(f'{DATA_PATH}/catechol_single_solvent_yields.csv')
df_full = pd.read_csv(f'{DATA_PATH}/catechol_full_data_yields.csv')

print('Single Solvent Data:')
print(f'  Shape: {df_single.shape}')
print(f'  Solvents: {df_single["SOLVENT NAME"].nunique()}')
print(f'  Samples per solvent: {df_single.groupby("SOLVENT NAME").size().describe()}')

print('\nFull Data (Mixtures):')
print(f'  Shape: {df_full.shape}')
print(f'  Unique ramps: {df_full.groupby(["SOLVENT A NAME", "SOLVENT B NAME"]).ngroups}')

Single Solvent Data:
  Shape: (656, 13)
  Solvents: 24
  Samples per solvent: count    24.000000
mean     27.333333
std      13.528285
min       5.000000
25%      18.000000
50%      22.000000
75%      37.000000
max      59.000000
dtype: float64

Full Data (Mixtures):
  Shape: (1227, 19)
  Unique ramps: 13


In [2]:
# Analyze target distributions
print('=== Target Distribution Analysis ===')
for col in ['Product 2', 'Product 3', 'SM']:
    print(f'\n{col}:')
    print(f'  Single: mean={df_single[col].mean():.4f}, std={df_single[col].std():.4f}, min={df_single[col].min():.4f}, max={df_single[col].max():.4f}')
    print(f'  Full:   mean={df_full[col].mean():.4f}, std={df_full[col].std():.4f}, min={df_full[col].min():.4f}, max={df_full[col].max():.4f}')

=== Target Distribution Analysis ===

Product 2:
  Single: mean=0.1499, std=0.1431, min=0.0000, max=0.4636
  Full:   mean=0.1646, std=0.1535, min=0.0000, max=0.4636

Product 3:
  Single: mean=0.1234, std=0.1315, min=0.0000, max=0.5338
  Full:   mean=0.1437, std=0.1458, min=0.0000, max=0.5338

SM:
  Single: mean=0.5222, std=0.3602, min=0.0000, max=1.0000
  Full:   mean=0.4952, std=0.3794, min=0.0000, max=1.0833


In [3]:
# Analyze CV-LB relationship from submissions
submissions = [
    ('exp_000', 0.0111, 0.0982),
    ('exp_001', 0.0123, 0.1065),
    ('exp_003', 0.0105, 0.0972),
    ('exp_005', 0.0104, 0.0969),
    ('exp_006', 0.0097, 0.0946),
    ('exp_007', 0.0093, 0.0932),
    ('exp_009', 0.0092, 0.0936),
    ('exp_012', 0.0090, 0.0913),
    ('exp_024', 0.0087, 0.0893),
    ('exp_026', 0.0085, 0.0887),
]

cv_scores = [s[1] for s in submissions]
lb_scores = [s[2] for s in submissions]

# Linear fit
from scipy import stats
slope, intercept, r_value, p_value, std_err = stats.linregress(cv_scores, lb_scores)

print('=== CV-LB Relationship ===')
print(f'Linear fit: LB = {slope:.2f} * CV + {intercept:.4f}')
print(f'R² = {r_value**2:.4f}')
print(f'\nTo hit target LB = 0.0347:')
print(f'  Required CV = (0.0347 - {intercept:.4f}) / {slope:.2f} = {(0.0347 - intercept) / slope:.6f}')

# But this gives negative CV, which is impossible
if (0.0347 - intercept) / slope < 0:
    print('\n⚠️ WARNING: Linear extrapolation suggests target is unreachable!')
    print('This means we need a FUNDAMENTALLY DIFFERENT approach, not incremental CV improvement.')

=== CV-LB Relationship ===
Linear fit: LB = 4.25 * CV + 0.0530
R² = 0.9622

To hit target LB = 0.0347:
  Required CV = (0.0347 - 0.0530) / 4.25 = -0.004308

This means we need a FUNDAMENTALLY DIFFERENT approach, not incremental CV improvement.


In [4]:
# What approaches haven't been tried?
print('=== UNEXPLORED APPROACHES ===')
print('''
1. **Quantile Regression** - Predict median instead of mean
   - Could reduce impact of outliers
   - CatBoost supports this natively

2. **Beta Regression** - Model yields as Beta distribution
   - Natural for [0,1] bounded data
   - Handles heteroscedasticity

3. **Mixture of Experts** - Different models for different solvent types
   - Alcohols vs Esters vs Others
   - Could capture different kinetics

4. **Feature Selection via Permutation Importance**
   - Remove features that hurt generalization
   - Focus on most robust features

5. **Adversarial Validation**
   - Identify which features cause train/test shift
   - Remove or transform those features

6. **Ensemble with Different Feature Sets**
   - Model 1: Only Spange (13 features)
   - Model 2: Only DRFP (122 features)
   - Model 3: Only Kinetics (5 features)
   - Blend predictions

7. **Temperature/Time Stratified CV**
   - Ensure CV folds have similar T/t distributions
   - May reduce CV-LB gap

8. **Pseudo-Labeling**
   - Use model predictions on test set as soft labels
   - Retrain with augmented data
''')

=== UNEXPLORED APPROACHES ===

1. **Quantile Regression** - Predict median instead of mean
   - Could reduce impact of outliers
   - CatBoost supports this natively

2. **Beta Regression** - Model yields as Beta distribution
   - Natural for [0,1] bounded data
   - Handles heteroscedasticity

3. **Mixture of Experts** - Different models for different solvent types
   - Alcohols vs Esters vs Others
   - Could capture different kinetics

4. **Feature Selection via Permutation Importance**
   - Remove features that hurt generalization
   - Focus on most robust features

5. **Adversarial Validation**
   - Identify which features cause train/test shift
   - Remove or transform those features

6. **Ensemble with Different Feature Sets**
   - Model 1: Only Spange (13 features)
   - Model 2: Only DRFP (122 features)
   - Model 3: Only Kinetics (5 features)
   - Blend predictions

7. **Temperature/Time Stratified CV**
   - Ensure CV folds have similar T/t distributions
   - May reduce CV-LB gap

In [5]:
# Check if there's a pattern in which solvents are harder to predict
# Load Spange descriptors
SPANGE_DF = pd.read_csv(f'{DATA_PATH}/spange_descriptors_lookup.csv', index_col=0)

print('=== Solvent Analysis ===')
print(f'Solvents in Spange lookup: {len(SPANGE_DF)}')
print(f'Solvents in single data: {df_single["SOLVENT NAME"].nunique()}')
print(f'\nSolvents in single data:')
print(sorted(df_single['SOLVENT NAME'].unique()))
print(f'\nSolvents in Spange lookup:')
print(sorted(SPANGE_DF.index.tolist()))

=== Solvent Analysis ===
Solvents in Spange lookup: 26
Solvents in single data: 24

Solvents in single data:
['1,1,1,3,3,3-Hexafluoropropan-2-ol', '2,2,2-Trifluoroethanol', '2-Methyltetrahydrofuran [2-MeTHF]', 'Acetonitrile', 'Acetonitrile.Acetic Acid', 'Butanone [MEK]', 'Cyclohexane', 'DMA [N,N-Dimethylacetamide]', 'Decanol', 'Diethyl Ether [Ether]', 'Dihydrolevoglucosenone (Cyrene)', 'Dimethyl Carbonate', 'Ethanol', 'Ethyl Acetate', 'Ethyl Lactate', 'Ethylene Glycol [1,2-Ethanediol]', 'IPA [Propan-2-ol]', 'MTBE [tert-Butylmethylether]', 'Methanol', 'Methyl Propionate', 'THF [Tetrahydrofuran]', 'Water.2,2,2-Trifluoroethanol', 'Water.Acetonitrile', 'tert-Butanol [2-Methylpropan-2-ol]']

Solvents in Spange lookup:
['1,1,1,3,3,3-Hexafluoropropan-2-ol', '2,2,2-Trifluoroethanol', '2-Methyltetrahydrofuran [2-MeTHF]', 'Acetic Acid', 'Acetonitrile', 'Acetonitrile.Acetic Acid', 'Butanone [MEK]', 'Cyclohexane', 'DMA [N,N-Dimethylacetamide]', 'Decanol', 'Diethyl Ether [Ether]', 'Dihydrolevogluco

In [6]:
# Check the variance of targets per solvent
print('=== Per-Solvent Target Variance ===')
solvent_stats = df_single.groupby('SOLVENT NAME')[['Product 2', 'Product 3', 'SM']].agg(['mean', 'std'])
print(solvent_stats.round(4))

=== Per-Solvent Target Variance ===
                                   Product 2         Product 3          \
                                        mean     std      mean     std   
SOLVENT NAME                                                             
1,1,1,3,3,3-Hexafluoropropan-2-ol     0.3197  0.0928    0.2854  0.0717   
2,2,2-Trifluoroethanol                0.1568  0.0736    0.0500  0.0559   
2-Methyltetrahydrofuran [2-MeTHF]     0.1506  0.1459    0.1006  0.0884   
Acetonitrile                          0.1564  0.1476    0.0890  0.0902   
Acetonitrile.Acetic Acid              0.0193  0.0113    0.0206  0.0080   
Butanone [MEK]                        0.0472  0.0184    0.0430  0.0165   
Cyclohexane                           0.0839  0.0893    0.0493  0.0500   
DMA [N,N-Dimethylacetamide]           0.1171  0.1265    0.0976  0.1026   
Decanol                               0.1948  0.1747    0.2080  0.1844   
Diethyl Ether [Ether]                 0.0811  0.0966    0.0631  0.0715   
Di

In [7]:
# Key insight: The CV-LB gap is ~10x
# This is NOT normal. Typical gaps are 1.1-1.5x
# 
# Possible causes:
# 1. Test set has solvents NOT in training (out-of-distribution)
# 2. Test set has different T/t ranges
# 3. Test set has different mixture compositions
# 4. Our CV scheme is too optimistic (leakage?)

print('=== CRITICAL INSIGHT ===')
print('''
The CV-LB gap of ~10x is ABNORMAL.

Our CV (LOO/LORO) should be pessimistic, not optimistic.
Yet LB is 10x worse than CV.

This suggests:
1. The test set has FUNDAMENTALLY DIFFERENT characteristics
2. OR our model is overfitting to training solvents in a way CV doesn't catch

The linear fit shows:
- LB = 4.22 * CV + 0.0533
- Intercept = 0.0533 > Target = 0.0347

This means even with PERFECT CV (0.0), we'd still get LB = 0.0533 > 0.0347!

IMPLICATION: We need to REDUCE THE INTERCEPT, not just improve CV.
The intercept represents the "irreducible" gap - likely due to distribution shift.
''')

=== CRITICAL INSIGHT ===

The CV-LB gap of ~10x is ABNORMAL.

Our CV (LOO/LORO) should be pessimistic, not optimistic.
Yet LB is 10x worse than CV.

This suggests:
1. The test set has FUNDAMENTALLY DIFFERENT characteristics
2. OR our model is overfitting to training solvents in a way CV doesn't catch

The linear fit shows:
- LB = 4.22 * CV + 0.0533
- Intercept = 0.0533 > Target = 0.0347

This means even with PERFECT CV (0.0), we'd still get LB = 0.0533 > 0.0347!

IMPLICATION: We need to REDUCE THE INTERCEPT, not just improve CV.
The intercept represents the "irreducible" gap - likely due to distribution shift.



In [8]:
# What could reduce the intercept?
print('=== STRATEGIES TO REDUCE CV-LB INTERCEPT ===')
print('''
1. **Simpler Models** - Less capacity = less overfitting to train distribution
   - Already tried: [32,16] MLP is best
   - Could try: Linear models with regularization

2. **Feature Engineering for Robustness**
   - Use only features that are stable across distributions
   - Remove features that are highly correlated with specific solvents

3. **Domain Adaptation**
   - If we knew which solvents are in test, we could adapt
   - Without that, use techniques like CORAL, MMD

4. **Ensemble Diversity**
   - Different models may have different biases
   - Averaging could cancel out some bias

5. **Calibration**
   - Post-hoc calibration using validation set
   - Platt scaling, isotonic regression

6. **Conservative Predictions**
   - Shrink predictions toward mean
   - Reduces variance at cost of bias
''')

=== STRATEGIES TO REDUCE CV-LB INTERCEPT ===

1. **Simpler Models** - Less capacity = less overfitting to train distribution
   - Already tried: [32,16] MLP is best
   - Could try: Linear models with regularization

2. **Feature Engineering for Robustness**
   - Use only features that are stable across distributions
   - Remove features that are highly correlated with specific solvents

3. **Domain Adaptation**
   - If we knew which solvents are in test, we could adapt
   - Without that, use techniques like CORAL, MMD

4. **Ensemble Diversity**
   - Different models may have different biases
   - Averaging could cancel out some bias

5. **Calibration**
   - Post-hoc calibration using validation set
   - Platt scaling, isotonic regression

6. **Conservative Predictions**
   - Shrink predictions toward mean
   - Reduces variance at cost of bias



In [9]:
# Let's check what the top LB score (0.01727) implies
target_lb = 0.01727
best_lb = 0.0887

print('=== GAP TO TOP SOLUTION ===')
print(f'Our best LB: {best_lb:.4f}')
print(f'Top LB: {target_lb:.4f}')
print(f'Gap: {best_lb - target_lb:.4f} ({(best_lb - target_lb) / target_lb * 100:.1f}% worse)')
print(f'\nTo match top LB, we need to reduce error by {(best_lb - target_lb) / best_lb * 100:.1f}%')
print(f'\nThis is a MASSIVE gap. The top solution must have:')
print('  1. A fundamentally different approach')
print('  2. Domain knowledge we are missing')
print('  3. Better handling of distribution shift')
print('  4. Or access to information we don\'t have')

=== GAP TO TOP SOLUTION ===
Our best LB: 0.0887
Top LB: 0.0173
Gap: 0.0714 (413.6% worse)

To match top LB, we need to reduce error by 80.5%

This is a MASSIVE gap. The top solution must have:
  1. A fundamentally different approach
  2. Domain knowledge we are missing
  3. Better handling of distribution shift
  4. Or access to information we don't have
