# Loop 39 Analysis: Critical Path Forward

## Current Situation
- **Best CV**: 0.008194 (exp_032/035/036)
- **Best LB**: 0.0877 (exp_030)
- **Target**: 0.0347
- **Gap**: 2.53x
- **Submissions remaining**: 4

## Critical Issue
The CV-LB relationship has intercept (0.0525) > target (0.0347).
This means even CV=0 would give LB=0.0525, which is WORSE than target.

## Key Question
How can we change the CV-LB relationship to reach the target?

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Submission history
submissions = [
    ('exp_000', 0.0111, 0.0982),
    ('exp_001', 0.0123, 0.1065),
    ('exp_003', 0.0105, 0.0972),
    ('exp_005', 0.0104, 0.0969),
    ('exp_006', 0.0097, 0.0946),
    ('exp_007', 0.0093, 0.0932),
    ('exp_009', 0.0092, 0.0936),
    ('exp_012', 0.0090, 0.0913),
    ('exp_024', 0.0087, 0.0893),
    ('exp_026', 0.0085, 0.0887),
    ('exp_030', 0.0083, 0.0877),
    ('exp_035', 0.0098, 0.0970),
]

cv_scores = np.array([s[1] for s in submissions])
lb_scores = np.array([s[2] for s in submissions])

# Linear fit
slope, intercept, r_value, p_value, std_err = stats.linregress(cv_scores, lb_scores)

print(f'CV-LB Relationship: LB = {slope:.2f}*CV + {intercept:.4f}')
print(f'R² = {r_value**2:.4f}')
print(f'\nIntercept: {intercept:.4f}')
print(f'Target: 0.0347')
print(f'Gap: {intercept / 0.0347:.2f}x')

CV-LB Relationship: LB = 4.31*CV + 0.0525
R² = 0.9505

Intercept: 0.0525
Target: 0.0347
Gap: 1.51x


In [2]:
# Analyze what's different about the best vs worst submissions
print('=== Submission Analysis ===')
for name, cv, lb in sorted(submissions, key=lambda x: x[2]):
    ratio = lb / cv
    print(f'{name}: CV={cv:.4f}, LB={lb:.4f}, Ratio={ratio:.1f}x')

=== Submission Analysis ===
exp_030: CV=0.0083, LB=0.0877, Ratio=10.6x
exp_026: CV=0.0085, LB=0.0887, Ratio=10.4x
exp_024: CV=0.0087, LB=0.0893, Ratio=10.3x
exp_012: CV=0.0090, LB=0.0913, Ratio=10.1x
exp_007: CV=0.0093, LB=0.0932, Ratio=10.0x
exp_009: CV=0.0092, LB=0.0936, Ratio=10.2x
exp_006: CV=0.0097, LB=0.0946, Ratio=9.8x
exp_005: CV=0.0104, LB=0.0969, Ratio=9.3x
exp_035: CV=0.0098, LB=0.0970, Ratio=9.9x
exp_003: CV=0.0105, LB=0.0972, Ratio=9.3x
exp_000: CV=0.0111, LB=0.0982, Ratio=8.8x
exp_001: CV=0.0123, LB=0.1065, Ratio=8.7x


In [3]:
# Check what approaches have been tried
print('=== Approaches Tried ===')
approaches = {
    'MLP with Spange': ['exp_000', 'exp_006', 'exp_007'],
    'LightGBM': ['exp_001'],
    'DRFP features': ['exp_002', 'exp_003'],
    'Large ensemble': ['exp_005'],
    'Simpler models': ['exp_006', 'exp_007', 'exp_008', 'exp_010'],
    'Ridge/Kernel Ridge': ['exp_033', 'exp_034'],
    'GP ensemble': ['exp_030', 'exp_031', 'exp_032', 'exp_035'],
    'Learned embeddings': ['exp_039 - FAILED'],
    'Similarity weighting': ['exp_037 - FAILED'],
    'Minimal features': ['exp_038 - FAILED'],
}

for approach, exps in approaches.items():
    print(f'  {approach}: {exps}')

=== Approaches Tried ===
  MLP with Spange: ['exp_000', 'exp_006', 'exp_007']
  LightGBM: ['exp_001']
  DRFP features: ['exp_002', 'exp_003']
  Large ensemble: ['exp_005']
  Simpler models: ['exp_006', 'exp_007', 'exp_008', 'exp_010']
  Ridge/Kernel Ridge: ['exp_033', 'exp_034']
  GP ensemble: ['exp_030', 'exp_031', 'exp_032', 'exp_035']
  Learned embeddings: ['exp_039 - FAILED']
  Similarity weighting: ['exp_037 - FAILED']
  Minimal features: ['exp_038 - FAILED']


In [4]:
# What HASN'T been tried?
print('=== Approaches NOT Yet Tried ===')
print('''
1. GNN (Graph Neural Network) - PRIORITY 1
   - PyTorch Geometric and AttentiveFP are available
   - Can generalize to unseen solvents through molecular structure
   - GNN benchmark achieved MSE 0.0039
   
2. k-NN with molecular similarity - PRIORITY 2
   - Use Tanimoto similarity on DRFP fingerprints
   - Weight predictions by similarity to training solvents
   - Simple to implement
   
3. Meta-learning / Few-shot learning - PRIORITY 3
   - Pre-train on all data, fine-tune on similar solvents
   - May help with distribution shift
   
4. Target transformation - PRIORITY 4
   - Predict in logit space
   - Use compositional constraints (yields sum to ~1)
''')

=== Approaches NOT Yet Tried ===

1. GNN (Graph Neural Network) - PRIORITY 1
   - PyTorch Geometric and AttentiveFP are available
   - Can generalize to unseen solvents through molecular structure
   - GNN benchmark achieved MSE 0.0039
   
2. k-NN with molecular similarity - PRIORITY 2
   - Use Tanimoto similarity on DRFP fingerprints
   - Weight predictions by similarity to training solvents
   - Simple to implement
   
3. Meta-learning / Few-shot learning - PRIORITY 3
   - Pre-train on all data, fine-tune on similar solvents
   - May help with distribution shift
   
4. Target transformation - PRIORITY 4
   - Predict in logit space
   - Use compositional constraints (yields sum to ~1)



In [5]:
# Check if GNN can be implemented
print('=== GNN Implementation Check ===')

try:
    import torch
    import torch_geometric
    from torch_geometric.nn.models import AttentiveFP
    from rdkit import Chem
    from rdkit.Chem import AllChem
    
    print(f'PyTorch version: {torch.__version__}')
    print(f'PyTorch Geometric version: {torch_geometric.__version__}')
    print('AttentiveFP: Available')
    print('RDKit: Available')
    
    # Test SMILES to graph conversion
    smiles = 'C1CCCCC1'  # Cyclohexane
    mol = Chem.MolFromSmiles(smiles)
    print(f'\nTest molecule: {smiles}')
    print(f'  Atoms: {mol.GetNumAtoms()}')
    print(f'  Bonds: {mol.GetNumBonds()}')
    
    print('\nGNN implementation is FEASIBLE!')
except Exception as e:
    print(f'GNN implementation NOT feasible: {e}')

=== GNN Implementation Check ===


PyTorch version: 2.2.0+cu118
PyTorch Geometric version: 2.7.0
AttentiveFP: Available
RDKit: Available

Test molecule: C1CCCCC1
  Atoms: 6
  Bonds: 6

GNN implementation is FEASIBLE!


In [6]:
# Load SMILES data
DATA_PATH = '/home/data'
smiles_df = pd.read_csv(f'{DATA_PATH}/smiles_lookup.csv')
print('=== SMILES Lookup ===')
print(smiles_df.head(10))
print(f'\nTotal solvents with SMILES: {len(smiles_df)}')

=== SMILES Lookup ===
                        SOLVENT NAME          solvent smiles
0                        Cyclohexane                C1CCCCC1
1                      Ethyl Acetate               O=C(OCC)C
2                        Acetic Acid                 CC(=O)O
3  2-Methyltetrahydrofuran [2-MeTHF]              O1C(C)CCC1
4  1,1,1,3,3,3-Hexafluoropropan-2-ol  C(C(F)(F)F)(C(F)(F)F)O
5                  IPA [Propan-2-ol]                  CC(O)C
6                            Ethanol                     CCO
7                           Methanol                      CO
8   Ethylene Glycol [1,2-Ethanediol]                    OCCO
9                       Acetonitrile                    CC#N

Total solvents with SMILES: 26


In [7]:
# Check if all solvents in the data have SMILES
X_single = pd.read_csv(f'{DATA_PATH}/catechol_single_solvent_yields.csv')
X_full = pd.read_csv(f'{DATA_PATH}/catechol_full_data_yields.csv')

single_solvents = set(X_single['SOLVENT NAME'].unique())
full_solvents_a = set(X_full['SOLVENT A NAME'].unique())
full_solvents_b = set(X_full['SOLVENT B NAME'].unique())
all_solvents = single_solvents | full_solvents_a | full_solvents_b

smiles_solvents = set(smiles_df['SOLVENT NAME'].values)

missing = all_solvents - smiles_solvents
print(f'Total unique solvents in data: {len(all_solvents)}')
print(f'Solvents with SMILES: {len(smiles_solvents)}')
print(f'Missing SMILES: {len(missing)}')
if missing:
    print(f'Missing solvents: {missing}')

Total unique solvents in data: 24
Solvents with SMILES: 26
Missing SMILES: 0


In [8]:
# Analyze the CV-LB gap more carefully
print('=== CV-LB Gap Analysis ===')

# The gap is NOT constant - it varies with CV
for name, cv, lb in sorted(submissions, key=lambda x: x[1]):
    gap = lb - cv
    ratio = lb / cv
    print(f'{name}: CV={cv:.4f}, LB={lb:.4f}, Gap={gap:.4f}, Ratio={ratio:.1f}x')

print(f'\nAverage gap: {np.mean(lb_scores - cv_scores):.4f}')
print(f'Average ratio: {np.mean(lb_scores / cv_scores):.1f}x')

=== CV-LB Gap Analysis ===
exp_030: CV=0.0083, LB=0.0877, Gap=0.0794, Ratio=10.6x
exp_026: CV=0.0085, LB=0.0887, Gap=0.0802, Ratio=10.4x
exp_024: CV=0.0087, LB=0.0893, Gap=0.0806, Ratio=10.3x
exp_012: CV=0.0090, LB=0.0913, Gap=0.0823, Ratio=10.1x
exp_009: CV=0.0092, LB=0.0936, Gap=0.0844, Ratio=10.2x
exp_007: CV=0.0093, LB=0.0932, Gap=0.0839, Ratio=10.0x
exp_006: CV=0.0097, LB=0.0946, Gap=0.0849, Ratio=9.8x
exp_035: CV=0.0098, LB=0.0970, Gap=0.0872, Ratio=9.9x
exp_005: CV=0.0104, LB=0.0969, Gap=0.0865, Ratio=9.3x
exp_003: CV=0.0105, LB=0.0972, Gap=0.0867, Ratio=9.3x
exp_000: CV=0.0111, LB=0.0982, Gap=0.0871, Ratio=8.8x
exp_001: CV=0.0123, LB=0.1065, Gap=0.0942, Ratio=8.7x

Average gap: 0.0848
Average ratio: 9.8x


In [9]:
# Key insight: The CV-LB relationship is LINEAR with high R²
# This means the LB evaluation is CONSISTENT with our CV
# The problem is the INTERCEPT is too high
#
# What could cause a high intercept?
# 1. Systematic bias in predictions (always over/under-predicting)
# 2. Different data distribution on LB
# 3. Different evaluation metric on LB
#
# The fact that the relationship is LINEAR suggests:
# - The LB evaluation is similar to our CV
# - The intercept is due to systematic bias

print('=== Hypothesis: Systematic Bias ===')
print('''
The CV-LB relationship is LB = 4.31*CV + 0.0525

The intercept (0.0525) suggests systematic bias.
This bias is INDEPENDENT of model quality (CV).

Possible causes:
1. Our predictions are systematically biased
2. The LB uses different solvents than our CV
3. The LB has different data distribution

To reduce the intercept, we need to:
1. Change the approach fundamentally (GNN)
2. Calibrate predictions to reduce bias
3. Use a different feature representation
''')

=== Hypothesis: Systematic Bias ===

The CV-LB relationship is LB = 4.31*CV + 0.0525

The intercept (0.0525) suggests systematic bias.
This bias is INDEPENDENT of model quality (CV).

Possible causes:
1. Our predictions are systematically biased
2. The LB uses different solvents than our CV
3. The LB has different data distribution

To reduce the intercept, we need to:
1. Change the approach fundamentally (GNN)
2. Calibrate predictions to reduce bias
3. Use a different feature representation



In [10]:
# Final recommendation
print('=== FINAL RECOMMENDATION ===')
print('''
**PRIORITY 1: Implement GNN with AttentiveFP**

Rationale:
- GNN benchmark achieved MSE 0.0039 on this exact dataset
- GNN learns from molecular STRUCTURE, not IDENTITY
- Can generalize to unseen solvents
- May have a DIFFERENT CV-LB relationship

Implementation:
1. Convert SMILES to molecular graphs using RDKit
2. Use AttentiveFP from PyTorch Geometric
3. Combine with kinetics features (1/T, ln(t))
4. Train with leave-one-solvent-out CV

**PRIORITY 2: k-NN with Tanimoto Similarity**

Rationale:
- Simple to implement
- Uses molecular fingerprints for similarity
- May help with distribution shift

**DO NOT SUBMIT** the current submission file.
It's from exp_038 (minimal features) with CV 0.009825,
which is WORSE than our best CV (0.008194).

**Submissions remaining:** 4
**Best LB:** 0.0877
**Target:** 0.0347
**Gap:** 2.53x
''')

=== FINAL RECOMMENDATION ===

**PRIORITY 1: Implement GNN with AttentiveFP**

Rationale:
- GNN benchmark achieved MSE 0.0039 on this exact dataset
- GNN learns from molecular STRUCTURE, not IDENTITY
- Can generalize to unseen solvents
- May have a DIFFERENT CV-LB relationship

Implementation:
1. Convert SMILES to molecular graphs using RDKit
2. Use AttentiveFP from PyTorch Geometric
3. Combine with kinetics features (1/T, ln(t))
4. Train with leave-one-solvent-out CV

**PRIORITY 2: k-NN with Tanimoto Similarity**

Rationale:
- Simple to implement
- Uses molecular fingerprints for similarity
- May help with distribution shift

**DO NOT SUBMIT** the current submission file.
It's from exp_038 (minimal features) with CV 0.009825,
which is WORSE than our best CV (0.008194).

**Submissions remaining:** 4
**Best LB:** 0.0877
**Target:** 0.0347
**Gap:** 2.53x

