# Loop 86 Analysis: Critical Assessment

## Key Facts:
1. **86 experiments completed**, all falling on the same CV-LB line
2. **Best CV**: 0.008092 (exp_049 CatBoost+XGBoost)
3. **Best LB**: 0.0877 (exp_030)
4. **Target**: 0.0347
5. **CV-LB relationship**: LB = 4.31 * CV + 0.0525 (RÂ² = 0.95)
6. **CRITICAL**: Intercept (0.0525) > Target (0.0347)

## This Loop's Findings:
- Pseudo-labeling: Made things WORSE (CV=0.008853 vs baseline 0.008092)
- Self-training: 9.4% worse than baseline
- Conservative predictions: Also made things worse

## The Fundamental Problem:
The CV-LB intercept (0.0525) is HIGHER than the target (0.0347). This means:
- Even with PERFECT CV=0, expected LB would be 0.0525
- Required CV to hit target = (0.0347 - 0.0525) / 4.31 = -0.0041 (IMPOSSIBLE!)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

# CV-LB data from all submissions
submissions = [
    ('exp_000', 0.0111, 0.0982),
    ('exp_001', 0.0123, 0.1065),
    ('exp_003', 0.0105, 0.0972),
    ('exp_005', 0.0104, 0.0969),
    ('exp_006', 0.0097, 0.0946),
    ('exp_007', 0.0093, 0.0932),
    ('exp_009', 0.0092, 0.0936),
    ('exp_012', 0.0090, 0.0913),
    ('exp_024', 0.0087, 0.0893),
    ('exp_026', 0.0085, 0.0887),
    ('exp_030', 0.0083, 0.0877),
    ('exp_035', 0.0098, 0.0970),
    # exp_073 is an outlier (similarity weighting BACKFIRED)
    ('exp_073', 0.0084, 0.1451),
]

# Exclude exp_073 (outlier)
valid_submissions = [s for s in submissions if s[0] != 'exp_073']

cv_scores = [s[1] for s in valid_submissions]
lb_scores = [s[2] for s in valid_submissions]

# Fit linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(cv_scores, lb_scores)

print(f'CV-LB Relationship: LB = {slope:.2f} * CV + {intercept:.4f}')
print(f'R-squared: {r_value**2:.4f}')
print(f'Intercept: {intercept:.4f}')
print(f'Target LB: 0.0347')
print(f'\nCRITICAL: Intercept ({intercept:.4f}) > Target (0.0347)')
print(f'Required CV for target: ({0.0347} - {intercept:.4f}) / {slope:.2f} = {(0.0347 - intercept) / slope:.4f}')
print(f'\nThis is IMPOSSIBLE - CV cannot be negative!')

CV-LB Relationship: LB = 4.31 * CV + 0.0525
R-squared: 0.9505
Intercept: 0.0525
Target LB: 0.0347

CRITICAL: Intercept (0.0525) > Target (0.0347)
Required CV for target: (0.0347 - 0.0525) / 4.31 = -0.0041

This is IMPOSSIBLE - CV cannot be negative!


In [2]:
# Analyze what approaches have been tried
approaches_tried = [
    ('MLP variants', '50+ experiments', 'All on same line'),
    ('LightGBM', 'Multiple configs', 'All on same line'),
    ('XGBoost', 'Multiple configs', 'All on same line'),
    ('CatBoost', 'Multiple configs', 'All on same line'),
    ('Gaussian Processes', 'Multiple configs', 'All on same line'),
    ('Ridge Regression', 'Multiple configs', 'All on same line'),
    ('GNN from scratch', 'CV=0.024', '3x worse than tabular'),
    ('ChemBERTa embeddings', 'CV=0.015', '2x worse than tabular'),
    ('ChemProp features', 'CV=0.012', '46% worse than tabular'),
    ('Pseudo-labeling', 'CV=0.0089', '9.4% worse'),
    ('Similarity weighting', 'LB=0.145', 'BACKFIRED'),
    ('Yield normalization', 'No effect', 'No improvement'),
    ('Conservative predictions', 'Made worse', 'No improvement'),
]

print('Approaches Tried:')
print('-' * 70)
for approach, result, notes in approaches_tried:
    print(f'{approach:30s} | {result:20s} | {notes}')
print('-' * 70)

Approaches Tried:
----------------------------------------------------------------------
MLP variants                   | 50+ experiments      | All on same line
LightGBM                       | Multiple configs     | All on same line
XGBoost                        | Multiple configs     | All on same line
CatBoost                       | Multiple configs     | All on same line
Gaussian Processes             | Multiple configs     | All on same line
Ridge Regression               | Multiple configs     | All on same line
GNN from scratch               | CV=0.024             | 3x worse than tabular
ChemBERTa embeddings           | CV=0.015             | 2x worse than tabular
ChemProp features              | CV=0.012             | 46% worse than tabular
Pseudo-labeling                | CV=0.0089            | 9.4% worse
Similarity weighting           | LB=0.145             | BACKFIRED
Yield normalization            | No effect            | No improvement
Conservative predictions       | M

## What Could Break the CV-LB Line?

The key insight is that ALL approaches fall on the same CV-LB line because they all:
1. Use the same features (Spange descriptors, DRFP, Arrhenius)
2. Use the same validation scheme (Leave-One-Out)
3. Make predictions in the same way (point predictions)

### Approaches That MIGHT Change the Relationship:

1. **Transductive Learning** - Use test set structure to inform predictions
2. **Physics-Based Constraints** - Enforce mass balance, monotonicity
3. **Calibration** - Isotonic regression, temperature scaling
4. **Different Prediction Strategy** - Predict ratios instead of absolute values
5. **Ensemble of Fundamentally Different Models** - Not just different hyperparameters

In [3]:
# Let's analyze what the benchmark paper might have done differently
# The benchmark achieved MSE 0.0039 (22x better than our best LB)

benchmark_mse = 0.0039
best_lb = 0.0877
target = 0.0347

print('Benchmark Analysis:')
print(f'Benchmark MSE: {benchmark_mse}')
print(f'Our Best LB: {best_lb}')
print(f'Target: {target}')
print(f'\nGap to benchmark: {best_lb / benchmark_mse:.1f}x worse')
print(f'Gap to target: {best_lb / target:.1f}x worse')

# If benchmark followed our CV-LB line, what would their CV be?
implied_cv = (benchmark_mse - intercept) / slope
print(f'\nIf benchmark followed our CV-LB line:')
print(f'  Implied CV: {implied_cv:.4f}')
print(f'  This is NEGATIVE, confirming they have a DIFFERENT CV-LB relationship')

# What intercept would they need?
print(f'\nTo achieve benchmark MSE 0.0039 with CV=0.008:')
required_intercept = benchmark_mse - slope * 0.008
print(f'  Required intercept: {required_intercept:.4f}')
print(f'  This is NEGATIVE, confirming they have fundamentally different approach')

Benchmark Analysis:
Benchmark MSE: 0.0039
Our Best LB: 0.0877
Target: 0.0347

Gap to benchmark: 22.5x worse
Gap to target: 2.5x worse

If benchmark followed our CV-LB line:
  Implied CV: -0.0113
  This is NEGATIVE, confirming they have a DIFFERENT CV-LB relationship

To achieve benchmark MSE 0.0039 with CV=0.008:
  Required intercept: -0.0306
  This is NEGATIVE, confirming they have fundamentally different approach


## Key Insight: The Benchmark Paper's Success

The benchmark paper achieved MSE 0.0039 using:
1. **Pre-trained GNN with Graph Attention Networks**
2. **DRFP features**
3. **Learned mixture-aware encodings**
4. **Pre-training on related reaction data**

The key difference is likely **pre-training on related data**. This allows the model to learn general chemistry knowledge that transfers to unseen solvents.

## What We Haven't Tried:

1. **Ratio-based predictions** - Predict Product2/SM and Product3/SM ratios instead of absolute values
2. **Hierarchical predictions** - First predict total conversion, then product distribution
3. **Physics-informed loss** - Penalize predictions that violate mass balance
4. **Adversarial validation** - Identify which features cause distribution shift
5. **Kernel-based similarity** - Use Tanimoto similarity for predictions

In [4]:
# Let's think about what could REDUCE the intercept
# The intercept represents structural extrapolation error

print('Strategies to Reduce Intercept:')
print('=' * 70)
print()
print('1. RATIO-BASED PREDICTIONS')
print('   Instead of predicting SM, P2, P3 directly,')
print('   predict ratios: P2/SM, P3/SM, then derive SM from mass balance')
print('   Why: Ratios might be more stable across solvents')
print()
print('2. HIERARCHICAL PREDICTIONS')
print('   Step 1: Predict total conversion (1 - SM)')
print('   Step 2: Predict product distribution (P2/(P2+P3), P3/(P2+P3))')
print('   Why: Separates two different chemical phenomena')
print()
print('3. PHYSICS-INFORMED CONSTRAINTS')
print('   Enforce: SM + P2 + P3 = 1 (mass balance)')
print('   Enforce: Monotonicity with time')
print('   Why: Domain knowledge that holds for ALL solvents')
print()
print('4. ADVERSARIAL VALIDATION')
print('   Train classifier to distinguish train vs test solvents')
print('   Identify which features cause distribution shift')
print('   Why: Can guide feature selection/weighting')
print()
print('5. KERNEL-BASED SIMILARITY')
print('   Use Tanimoto similarity between test and train solvents')
print('   Weight predictions by similarity')
print('   Why: More principled than simple distance-based weighting')

Strategies to Reduce Intercept:

1. RATIO-BASED PREDICTIONS
   Instead of predicting SM, P2, P3 directly,
   predict ratios: P2/SM, P3/SM, then derive SM from mass balance
   Why: Ratios might be more stable across solvents

2. HIERARCHICAL PREDICTIONS
   Step 1: Predict total conversion (1 - SM)
   Step 2: Predict product distribution (P2/(P2+P3), P3/(P2+P3))
   Why: Separates two different chemical phenomena

3. PHYSICS-INFORMED CONSTRAINTS
   Enforce: SM + P2 + P3 = 1 (mass balance)
   Enforce: Monotonicity with time
   Why: Domain knowledge that holds for ALL solvents

4. ADVERSARIAL VALIDATION
   Train classifier to distinguish train vs test solvents
   Identify which features cause distribution shift
   Why: Can guide feature selection/weighting

5. KERNEL-BASED SIMILARITY
   Use Tanimoto similarity between test and train solvents
   Weight predictions by similarity
   Why: More principled than simple distance-based weighting


In [6]:
# Let's check what the actual data looks like
import sys
sys.path.append('/home/code/data')

single_df = pd.read_csv('/home/code/data/catechol_single_solvent_yields.csv')
full_df = pd.read_csv('/home/code/data/catechol_full_data_yields.csv')

print('Single Solvent Data:')
print(f'  Shape: {single_df.shape}')
print(f'  Solvents: {single_df["SOLVENT NAME"].nunique()}')
print(f'  Samples per solvent: {len(single_df) / single_df["SOLVENT NAME"].nunique():.1f}')

print('\nFull Data:')
print(f'  Shape: {full_df.shape}')
print(f'  Unique ramps: {full_df[["SOLVENT A NAME", "SOLVENT B NAME"]].drop_duplicates().shape[0]}')

# Check target distributions
print('\nTarget Statistics (Single Solvent):')
for col in ['SM', 'Product 2', 'Product 3']:
    print(f'  {col}: mean={single_df[col].mean():.4f}, std={single_df[col].std():.4f}, min={single_df[col].min():.4f}, max={single_df[col].max():.4f}')

# Check mass balance
single_df['sum'] = single_df['SM'] + single_df['Product 2'] + single_df['Product 3']
print(f'\nMass Balance Check (SM + P2 + P3):')
print(f'  Mean: {single_df["sum"].mean():.6f}')
print(f'  Std: {single_df["sum"].std():.6f}')
print(f'  Min: {single_df["sum"].min():.6f}')
print(f'  Max: {single_df["sum"].max():.6f}')

Single Solvent Data:
  Shape: (656, 13)
  Solvents: 24
  Samples per solvent: 27.3

Full Data:
  Shape: (1227, 19)
  Unique ramps: 13

Target Statistics (Single Solvent):
  SM: mean=0.5222, std=0.3602, min=0.0000, max=1.0000
  Product 2: mean=0.1499, std=0.1431, min=0.0000, max=0.4636
  Product 3: mean=0.1234, std=0.1315, min=0.0000, max=0.5338

Mass Balance Check (SM + P2 + P3):
  Mean: 0.795504
  Std: 0.194306
  Min: 0.028752
  Max: 1.000000


## Recommended Next Steps

### Priority 1: Ratio-Based Predictions
Instead of predicting SM, P2, P3 directly, predict:
- Conversion = 1 - SM
- Selectivity_P2 = P2 / (P2 + P3)

Then derive:
- SM = 1 - Conversion
- P2 = Conversion * Selectivity_P2
- P3 = Conversion * (1 - Selectivity_P2)

**Why this might work:**
- Conversion and selectivity are more fundamental chemical quantities
- They might generalize better to unseen solvents
- Mass balance is automatically satisfied

### Priority 2: Physics-Informed Loss
Add penalty terms to the loss function:
- Mass balance penalty: (SM + P2 + P3 - 1)^2
- Monotonicity penalty: penalize if yield decreases with time

### Priority 3: Adversarial Validation
Train a classifier to distinguish train vs test solvents:
- If classifier can distinguish, there's distribution shift
- Features with high importance in classifier are causing the shift
- Can guide feature selection or weighting

In [7]:
# Interesting! Mass balance is NOT 1.0 - there are other products/losses
# Let's analyze this more

print('Mass Balance Analysis:')
print('=' * 70)
print(f'Mean sum (SM + P2 + P3): {single_df["sum"].mean():.4f}')
print(f'This means ~{(1 - single_df["sum"].mean()) * 100:.1f}% of material is unaccounted for')
print()

# Check if there's a pattern with conversion
single_df['conversion'] = 1 - single_df['SM']
print('Correlation between conversion and mass balance:')
print(f'  Correlation: {single_df["conversion"].corr(single_df["sum"]):.4f}')
print()

# Check the distribution of mass balance
print('Mass Balance Distribution:')
print(single_df['sum'].describe())
print()

# Check if mass balance varies by solvent
print('\nMass Balance by Solvent (top 5 highest and lowest):')
solvent_mass_balance = single_df.groupby('SOLVENT NAME')['sum'].mean().sort_values()
print('Lowest mass balance:')
print(solvent_mass_balance.head())
print('\nHighest mass balance:')
print(solvent_mass_balance.tail())

Mass Balance Analysis:
Mean sum (SM + P2 + P3): 0.7955
This means ~20.4% of material is unaccounted for

Correlation between conversion and mass balance:
  Correlation: -0.6784

Mass Balance Distribution:
count    656.000000
mean       0.795504
std        0.194306
min        0.028752
25%        0.708417
50%        0.849648
75%        0.927955
max        1.000000
Name: sum, dtype: float64


Mass Balance by Solvent (top 5 highest and lowest):
Lowest mass balance:
SOLVENT NAME
2,2,2-Trifluoroethanol         0.486005
Acetonitrile.Acetic Acid       0.518022
Cyclohexane                    0.678907
Methanol                       0.743034
DMA [N,N-Dimethylacetamide]    0.760036
Name: sum, dtype: float64

Highest mass balance:
SOLVENT NAME
Diethyl Ether [Ether]               0.948165
MTBE [tert-Butylmethylether]        0.959072
Dimethyl Carbonate                  0.961288
Ethylene Glycol [1,2-Ethanediol]    0.979044
IPA [Propan-2-ol]                   0.993506
Name: sum, dtype: float64


In [None]:
# Key insight: Mass balance varies by solvent
# This means we should NOT enforce mass balance = 1.0 as a constraint
# Instead, we could predict the "other products" fraction as a 4th target

# Let's check if predicting conversion and selectivity might be more stable
single_df['conversion'] = 1 - single_df['SM']
single_df['selectivity_P2'] = single_df['Product 2'] / (single_df['Product 2'] + single_df['Product 3'] + 1e-10)
single_df['selectivity_P3'] = single_df['Product 3'] / (single_df['Product 2'] + single_df['Product 3'] + 1e-10)

print('Conversion Statistics:')
print(f'  Mean: {single_df["conversion"].mean():.4f}')
print(f'  Std: {single_df["conversion"].std():.4f}')
print(f'  Min: {single_df["conversion"].min():.4f}')
print(f'  Max: {single_df["conversion"].max():.4f}')

print('\nSelectivity P2 Statistics:')
print(f'  Mean: {single_df["selectivity_P2"].mean():.4f}')
print(f'  Std: {single_df["selectivity_P2"].std():.4f}')
print(f'  Min: {single_df["selectivity_P2"].min():.4f}')
print(f'  Max: {single_df["selectivity_P2"].max():.4f}')

# Check variance by solvent
print('\nConversion Variance by Solvent:')
conversion_by_solvent = single_df.groupby('SOLVENT NAME')['conversion'].agg(['mean', 'std'])
print(f'  Mean of means: {conversion_by_solvent["mean"].mean():.4f}')
print(f'  Std of means: {conversion_by_solvent["mean"].std():.4f}')

print('\nSelectivity P2 Variance by Solvent:')
selectivity_by_solvent = single_df.groupby('SOLVENT NAME')['selectivity_P2'].agg(['mean', 'std'])
print(f'  Mean of means: {selectivity_by_solvent["mean"].mean():.4f}')
print(f'  Std of means: {selectivity_by_solvent["mean"].std():.4f}')