# Loop 40 Analysis: Critical Path to Target

## Current Situation
- **Best CV**: 0.008194 (exp_032/035/036)
- **Best LB**: 0.0877 (exp_030)
- **Target**: 0.0347
- **Gap**: 2.53x
- **Submissions remaining**: 4

## Critical Issue
The CV-LB relationship has intercept (0.0525) > target (0.0347).
This means even CV=0 would give LB=0.0525, which is WORSE than target.

## Key Questions
1. Why is the intercept so high?
2. What fundamentally different approach could change the CV-LB relationship?
3. What have we NOT tried that could work?

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Submission history
submissions = [
    ('exp_000', 0.0111, 0.0982),
    ('exp_001', 0.0123, 0.1065),
    ('exp_003', 0.0105, 0.0972),
    ('exp_005', 0.0104, 0.0969),
    ('exp_006', 0.0097, 0.0946),
    ('exp_007', 0.0093, 0.0932),
    ('exp_009', 0.0092, 0.0936),
    ('exp_012', 0.0090, 0.0913),
    ('exp_024', 0.0087, 0.0893),
    ('exp_026', 0.0085, 0.0887),
    ('exp_030', 0.0083, 0.0877),
    ('exp_035', 0.0098, 0.0970),
]

cv_scores = np.array([s[1] for s in submissions])
lb_scores = np.array([s[2] for s in submissions])

# Linear fit
slope, intercept, r_value, p_value, std_err = stats.linregress(cv_scores, lb_scores)

print('=== CV-LB Relationship Analysis ===')
print(f'Linear fit: LB = {slope:.2f}*CV + {intercept:.4f}')
print(f'R² = {r_value**2:.4f}')
print(f'\nIntercept: {intercept:.4f}')
print(f'Target: 0.0347')
print(f'Gap: {intercept / 0.0347:.2f}x')
print(f'\nTo reach target 0.0347, we need CV = {(0.0347 - intercept) / slope:.4f}')
print('This is NEGATIVE, meaning current approach CANNOT reach target!')

=== CV-LB Relationship Analysis ===
Linear fit: LB = 4.31*CV + 0.0525
R² = 0.9505

Intercept: 0.0525
Target: 0.0347
Gap: 1.51x

To reach target 0.0347, we need CV = -0.0041
This is NEGATIVE, meaning current approach CANNOT reach target!


In [2]:
# What's the BEST possible LB we could achieve with current approach?
# If we somehow got CV = 0 (impossible), LB would still be 0.0525

print('=== Best Possible LB with Current Approach ===')
print(f'If CV = 0.000: LB = {intercept:.4f}')
print(f'If CV = 0.005: LB = {slope * 0.005 + intercept:.4f}')
print(f'If CV = 0.008: LB = {slope * 0.008 + intercept:.4f}')
print(f'\nTarget: 0.0347')
print(f'\nCONCLUSION: Current approach family CANNOT reach target.')
print('We need a fundamentally different approach that changes the CV-LB relationship.')

=== Best Possible LB with Current Approach ===
If CV = 0.000: LB = 0.0525
If CV = 0.005: LB = 0.0741
If CV = 0.008: LB = 0.0870

Target: 0.0347

CONCLUSION: Current approach family CANNOT reach target.
We need a fundamentally different approach that changes the CV-LB relationship.


In [3]:
# What approaches have been tried?
print('=== Approaches Tried (40 experiments) ===')
approaches = {
    'MLP with Spange': 'exp_000, exp_006, exp_007 - BASELINE',
    'LightGBM': 'exp_001 - Worse than MLP',
    'DRFP features': 'exp_002, exp_003 - Combined with Spange helps',
    'Large ensemble (15 models)': 'exp_005 - Marginal improvement',
    'Simpler models': 'exp_006, exp_007, exp_008, exp_010 - Best CV',
    'Ridge/Kernel Ridge': 'exp_033, exp_034 - Much worse',
    'GP ensemble': 'exp_030, exp_031, exp_032, exp_035 - Best LB',
    'Learned embeddings': 'exp_039 - FAILED (test solvent never seen)',
    'Similarity weighting': 'exp_037 - FAILED (implementation bug)',
    'Minimal features': 'exp_038 - FAILED (DRFP features ARE valuable)',
    'Weighted loss': 'exp_026 - Helps slightly',
    'Per-target models': 'exp_025 - Helps slightly',
    'ACS PCA features': 'exp_023, exp_024 - Helps slightly',
}

for approach, result in approaches.items():
    print(f'  {approach}: {result}')

=== Approaches Tried (40 experiments) ===
  MLP with Spange: exp_000, exp_006, exp_007 - BASELINE
  LightGBM: exp_001 - Worse than MLP
  DRFP features: exp_002, exp_003 - Combined with Spange helps
  Large ensemble (15 models): exp_005 - Marginal improvement
  Simpler models: exp_006, exp_007, exp_008, exp_010 - Best CV
  Ridge/Kernel Ridge: exp_033, exp_034 - Much worse
  GP ensemble: exp_030, exp_031, exp_032, exp_035 - Best LB
  Learned embeddings: exp_039 - FAILED (test solvent never seen)
  Similarity weighting: exp_037 - FAILED (implementation bug)
  Minimal features: exp_038 - FAILED (DRFP features ARE valuable)
  Weighted loss: exp_026 - Helps slightly
  Per-target models: exp_025 - Helps slightly
  ACS PCA features: exp_023, exp_024 - Helps slightly


In [4]:
# What approaches have NOT been tried?
print('=== Approaches NOT Yet Tried ===')
print('''
1. GNN (Graph Neural Network) - HIGHEST PRIORITY
   - GNN benchmark achieved MSE 0.0039 on this exact dataset
   - Can generalize to unseen solvents through molecular structure
   - PyTorch Geometric and AttentiveFP are available
   - May have a DIFFERENT CV-LB relationship

2. k-NN with Tanimoto Similarity - MEDIUM PRIORITY
   - Use molecular fingerprints for similarity
   - Weight predictions by similarity to training solvents
   - Simple to implement

3. GroupKFold(5) instead of Leave-One-Out - INTERESTING
   - One public kernel (lishellliang) uses GroupKFold(5)
   - May be what Kaggle evaluation uses
   - Could explain part of the CV-LB gap

4. Calibration / Post-processing - LOW PRIORITY
   - Isotonic regression
   - Temperature scaling
   - May reduce intercept
''')

=== Approaches NOT Yet Tried ===

1. GNN (Graph Neural Network) - HIGHEST PRIORITY
   - GNN benchmark achieved MSE 0.0039 on this exact dataset
   - Can generalize to unseen solvents through molecular structure
   - PyTorch Geometric and AttentiveFP are available
   - May have a DIFFERENT CV-LB relationship

2. k-NN with Tanimoto Similarity - MEDIUM PRIORITY
   - Use molecular fingerprints for similarity
   - Weight predictions by similarity to training solvents
   - Simple to implement

3. GroupKFold(5) instead of Leave-One-Out - INTERESTING
   - One public kernel (lishellliang) uses GroupKFold(5)
   - May be what Kaggle evaluation uses
   - Could explain part of the CV-LB gap

4. Calibration / Post-processing - LOW PRIORITY
   - Isotonic regression
   - Temperature scaling
   - May reduce intercept



In [5]:
# Check if GNN can be implemented
print('=== GNN Implementation Check ===')

try:
    import torch
    import torch_geometric
    from torch_geometric.nn.models import AttentiveFP
    from rdkit import Chem
    from rdkit.Chem import AllChem
    
    print(f'PyTorch version: {torch.__version__}')
    print(f'PyTorch Geometric version: {torch_geometric.__version__}')
    print('AttentiveFP: Available')
    print('RDKit: Available')
    
    # Test SMILES to graph conversion
    smiles = 'C1CCCCC1'  # Cyclohexane
    mol = Chem.MolFromSmiles(smiles)
    print(f'\nTest molecule: {smiles}')
    print(f'  Atoms: {mol.GetNumAtoms()}')
    print(f'  Bonds: {mol.GetNumBonds()}')
    
    print('\nGNN implementation is FEASIBLE!')
except Exception as e:
    print(f'GNN implementation NOT feasible: {e}')

=== GNN Implementation Check ===


PyTorch version: 2.2.0+cu118
PyTorch Geometric version: 2.7.0
AttentiveFP: Available
RDKit: Available

Test molecule: C1CCCCC1
  Atoms: 6
  Bonds: 6

GNN implementation is FEASIBLE!


In [6]:
# Check SMILES data availability
DATA_PATH = '/home/data'
smiles_df = pd.read_csv(f'{DATA_PATH}/smiles_lookup.csv')
print('=== SMILES Lookup ===')
print(smiles_df.head(10))
print(f'\nTotal solvents with SMILES: {len(smiles_df)}')

=== SMILES Lookup ===
                        SOLVENT NAME          solvent smiles
0                        Cyclohexane                C1CCCCC1
1                      Ethyl Acetate               O=C(OCC)C
2                        Acetic Acid                 CC(=O)O
3  2-Methyltetrahydrofuran [2-MeTHF]              O1C(C)CCC1
4  1,1,1,3,3,3-Hexafluoropropan-2-ol  C(C(F)(F)F)(C(F)(F)F)O
5                  IPA [Propan-2-ol]                  CC(O)C
6                            Ethanol                     CCO
7                           Methanol                      CO
8   Ethylene Glycol [1,2-Ethanediol]                    OCCO
9                       Acetonitrile                    CC#N

Total solvents with SMILES: 26


In [7]:
# Check if all solvents in the data have SMILES
X_single = pd.read_csv(f'{DATA_PATH}/catechol_single_solvent_yields.csv')
X_full = pd.read_csv(f'{DATA_PATH}/catechol_full_data_yields.csv')

single_solvents = set(X_single['SOLVENT NAME'].unique())
full_solvents_a = set(X_full['SOLVENT A NAME'].unique())
full_solvents_b = set(X_full['SOLVENT B NAME'].unique())
all_solvents = single_solvents | full_solvents_a | full_solvents_b

smiles_solvents = set(smiles_df['SOLVENT NAME'].values)

missing = all_solvents - smiles_solvents
print(f'Total unique solvents in data: {len(all_solvents)}')
print(f'Solvents with SMILES: {len(smiles_solvents)}')
print(f'Missing SMILES: {len(missing)}')
if missing:
    print(f'Missing solvents: {missing}')

Total unique solvents in data: 24
Solvents with SMILES: 26
Missing SMILES: 0


In [8]:
# Analyze the GroupKFold(5) approach from lishellliang kernel
print('=== GroupKFold(5) Analysis ===')
print('''
The lishellliang kernel overwrites the leave-one-out splits to use GroupKFold(5).
This is significant because:

1. GroupKFold(5) means 5 folds instead of 24 (single solvent) or 13 (full data)
2. Each fold contains ~5 solvents instead of 1
3. This is LESS strict than leave-one-out
4. The model sees more diverse solvents during training

Possible implications:
- If Kaggle evaluation uses GroupKFold(5), our leave-one-out CV is TOO strict
- This could explain part of the CV-LB gap
- We should try GroupKFold(5) to see if it changes the CV-LB relationship

However, the competition description says:
"leaving-out (a) full experiments in the case of mixture solvents, and 
(b) a single solvent out in the case of no mixture solvents"

This suggests leave-one-out is the correct approach.
''')

=== GroupKFold(5) Analysis ===

The lishellliang kernel overwrites the leave-one-out splits to use GroupKFold(5).
This is significant because:

1. GroupKFold(5) means 5 folds instead of 24 (single solvent) or 13 (full data)
2. Each fold contains ~5 solvents instead of 1
3. This is LESS strict than leave-one-out
4. The model sees more diverse solvents during training

Possible implications:
- If Kaggle evaluation uses GroupKFold(5), our leave-one-out CV is TOO strict
- This could explain part of the CV-LB gap
- We should try GroupKFold(5) to see if it changes the CV-LB relationship

However, the competition description says:
"leaving-out (a) full experiments in the case of mixture solvents, and 
(b) a single solvent out in the case of no mixture solvents"

This suggests leave-one-out is the correct approach.



In [9]:
# Final recommendation
print('=== FINAL RECOMMENDATION ===')
print('''
**PRIORITY 1: Implement GNN with AttentiveFP**

Rationale:
- GNN benchmark achieved MSE 0.0039 on this exact dataset
- GNN learns from molecular STRUCTURE, not IDENTITY
- Can generalize to unseen solvents through graph structure
- May have a DIFFERENT CV-LB relationship (which is what we need)

Implementation:
1. Convert SMILES to molecular graphs using RDKit
2. Use AttentiveFP from PyTorch Geometric
3. Combine with kinetics features (1/T, ln(t))
4. Train with leave-one-solvent-out CV

**PRIORITY 2: k-NN with Tanimoto Similarity**

Rationale:
- Simple to implement
- Uses molecular fingerprints for similarity
- May help with distribution shift

**PRIORITY 3: Try GroupKFold(5) validation**

Rationale:
- One public kernel uses this approach
- May be closer to Kaggle evaluation
- Could explain part of the CV-LB gap

**Submissions remaining:** 4
**Best LB:** 0.0877
**Target:** 0.0347
**Gap:** 2.53x

The target IS achievable - GNN benchmark achieved 0.0039.
We need a fundamentally different approach.
''')

=== FINAL RECOMMENDATION ===

**PRIORITY 1: Implement GNN with AttentiveFP**

Rationale:
- GNN benchmark achieved MSE 0.0039 on this exact dataset
- GNN learns from molecular STRUCTURE, not IDENTITY
- Can generalize to unseen solvents through graph structure
- May have a DIFFERENT CV-LB relationship (which is what we need)

Implementation:
1. Convert SMILES to molecular graphs using RDKit
2. Use AttentiveFP from PyTorch Geometric
3. Combine with kinetics features (1/T, ln(t))
4. Train with leave-one-solvent-out CV

**PRIORITY 2: k-NN with Tanimoto Similarity**

Rationale:
- Simple to implement
- Uses molecular fingerprints for similarity
- May help with distribution shift

**PRIORITY 3: Try GroupKFold(5) validation**

Rationale:
- One public kernel uses this approach
- May be closer to Kaggle evaluation
- Could explain part of the CV-LB gap

**Submissions remaining:** 4
**Best LB:** 0.0877
**Target:** 0.0347
**Gap:** 2.53x

The target IS achievable - GNN benchmark achieved 0.0039.
We n