# Loop 38 Analysis: Learned Embeddings Failure & Path Forward

**Key Question**: Why did learned embeddings fail, and what approach CAN work for leave-one-solvent-out CV?

## The Fundamental Problem

In leave-one-solvent-out CV:
- The test solvent is NEVER seen during training
- Learned embeddings for unseen solvents are just random initialization
- This is why exp_037 got MSE 0.080438 (9.8x worse than baseline)

## Why GNN Would Work (But Learned Embeddings Don't)

1. **GNN**: Learns from molecular STRUCTURE (atoms, bonds, graph topology)
   - Even for unseen solvents, the GNN can process the molecular graph
   - The model learns general patterns about how molecular structure affects yield

2. **Learned Embeddings**: Learns from solvent IDENTITY
   - For unseen solvents, there's no identity to look up
   - The embedding is just random initialization

## The Real Question

Can we implement a GNN that generalizes to unseen solvents?

In [None]:
import pandas as pd
import numpy as np

# Load data to understand the problem
DATA_PATH = '/home/data'

# Check SMILES availability
smiles_df = pd.read_csv(f'{DATA_PATH}/smiles_lookup.csv')
print('SMILES lookup:')
print(smiles_df.head())
print(f'\nTotal solvents with SMILES: {len(smiles_df)}')

In [None]:
# Check what solvents are in the data
X_single = pd.read_csv(f'{DATA_PATH}/catechol_single_solvent_yields.csv')
X_full = pd.read_csv(f'{DATA_PATH}/catechol_full_data_yields.csv')

print('Single solvent data solvents:')
print(sorted(X_single['SOLVENT NAME'].unique()))
print(f'\nTotal: {len(X_single["SOLVENT NAME"].unique())}')

print('\nFull data solvents A:')
print(sorted(X_full['SOLVENT A NAME'].unique()))
print(f'\nTotal A: {len(X_full["SOLVENT A NAME"].unique())}')

print('\nFull data solvents B:')
print(sorted(X_full['SOLVENT B NAME'].unique()))
print(f'\nTotal B: {len(X_full["SOLVENT B NAME"].unique())}')

In [None]:
# Analyze the CV-LB relationship
submissions = [
    ('exp_000', 0.0111, 0.0982),
    ('exp_001', 0.0123, 0.1065),
    ('exp_003', 0.0105, 0.0972),
    ('exp_005', 0.0104, 0.0969),
    ('exp_006', 0.0097, 0.0946),
    ('exp_007', 0.0093, 0.0932),
    ('exp_009', 0.0092, 0.0936),
    ('exp_012', 0.0090, 0.0913),
    ('exp_024', 0.0087, 0.0893),
    ('exp_026', 0.0085, 0.0887),
    ('exp_030', 0.0083, 0.0877),
    ('exp_035', 0.0098, 0.0970),
]

cv_scores = np.array([s[1] for s in submissions])
lb_scores = np.array([s[2] for s in submissions])

# Linear fit
from scipy import stats
slope, intercept, r_value, p_value, std_err = stats.linregress(cv_scores, lb_scores)

print(f'CV-LB Relationship: LB = {slope:.2f}*CV + {intercept:.4f}')
print(f'RÂ² = {r_value**2:.4f}')
print(f'\nIntercept: {intercept:.4f}')
print(f'Target: 0.0347')
print(f'\nIntercept > Target: {intercept > 0.0347}')
print(f'\nTo reach target with current relationship:')
required_cv = (0.0347 - intercept) / slope
print(f'Required CV: {required_cv:.6f}')
if required_cv < 0:
    print('IMPOSSIBLE - would require negative CV!')

In [None]:
# Key insight: The CV-LB relationship has a large positive intercept
# This means even CV=0 would give LB=0.0527 > target 0.0347
# 
# This suggests a SYSTEMATIC BIAS in our approach that cannot be fixed by improving CV
#
# What could cause this?
# 1. Our features don't capture something important about the test solvents
# 2. The LB evaluation uses a different distribution than our local CV
# 3. Our models systematically overfit to training solvents

print('=== ANALYSIS OF THE CV-LB GAP ===')
print(f'\nBest CV: 0.0083 (exp_030)')
print(f'Best LB: 0.0877 (exp_030)')
print(f'Gap: {0.0877 / 0.0083:.1f}x')
print(f'\nTarget: 0.0347')
print(f'Gap to target: {0.0877 / 0.0347:.1f}x')
print(f'\nGNN benchmark: 0.0039')
print(f'Gap GNN to target: {0.0347 / 0.0039:.1f}x (target is 8.9x WORSE than GNN)')
print(f'\nThis proves the target is VERY achievable!')

In [None]:
# What's the difference between our approach and GNN?
#
# Our approach:
# - Fixed features (Spange, DRFP, ACS PCA)
# - Linear mixture interpolation
# - MLP/LGBM/GP ensemble
#
# GNN approach:
# - Learned features from molecular structure
# - Non-linear mixture handling
# - Graph attention for message passing
#
# The key difference: GNN learns GENERAL patterns about molecular structure
# that can transfer to unseen solvents. Our fixed features cannot.

print('=== KEY INSIGHT ===')
print('''\nThe learned embeddings approach failed because it learns IDENTITY, not STRUCTURE.

For leave-one-solvent-out CV:
- Learned embeddings: Test solvent has random embedding (FAILS)
- GNN: Test solvent has meaningful embedding from molecular structure (WORKS)

The GNN can generalize because it learns:
1. How atom types affect yield
2. How bond types affect yield
3. How molecular topology affects yield

These patterns transfer to unseen solvents because the atoms and bonds
are the SAME - just arranged differently.''')

In [None]:
# Can we implement a simpler version that captures some of this?
#
# Option 1: Full GNN with AttentiveFP
# - Complex but proven to work
# - Requires SMILES -> molecular graph conversion
# - PyTorch Geometric is available
#
# Option 2: Solvent similarity-based prediction
# - For each test solvent, find k most similar training solvents
# - Weight predictions by similarity
# - Uses Spange/DRFP similarity (which captures molecular structure)
#
# Option 3: Pre-computed molecular fingerprints as features
# - Instead of learned embeddings, use ECFP/Morgan fingerprints
# - These capture molecular structure and transfer to unseen solvents

print('=== VIABLE APPROACHES FOR UNSEEN SOLVENTS ===')
print('''\n1. GNN (AttentiveFP)
   - Learns from molecular structure
   - Can generalize to unseen solvents
   - Complex to implement

2. k-NN with Spange/DRFP similarity
   - For test solvent, find k most similar training solvents
   - Weight predictions by similarity
   - Simple to implement
   - May not change CV-LB relationship

3. Morgan fingerprints (ECFP)
   - Pre-computed molecular fingerprints
   - Capture molecular structure
   - Can transfer to unseen solvents
   - Already have DRFP which is similar

4. Hybrid: Fixed features + GNN embedding
   - Use GNN to get solvent embedding
   - Combine with kinetics features
   - Best of both worlds''')

In [None]:
# Let's check if we have the tools for GNN
import torch
print(f'PyTorch version: {torch.__version__}')
print(f'CUDA available: {torch.cuda.is_available()}')

try:
    import torch_geometric
    print(f'PyTorch Geometric version: {torch_geometric.__version__}')
    from torch_geometric.nn import AttentiveFP
    print('AttentiveFP available: YES')
except ImportError as e:
    print(f'PyTorch Geometric error: {e}')

try:
    from rdkit import Chem
    print('RDKit available: YES')
except ImportError as e:
    print(f'RDKit error: {e}')

In [None]:
# Test SMILES to molecular graph conversion
from rdkit import Chem
from rdkit.Chem import AllChem
import torch
from torch_geometric.data import Data

def smiles_to_graph(smiles):
    """Convert SMILES to PyTorch Geometric Data object."""
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return None
    
    # Atom features
    atom_features = []
    for atom in mol.GetAtoms():
        features = [
            atom.GetAtomicNum(),
            atom.GetDegree(),
            atom.GetFormalCharge(),
            atom.GetNumRadicalElectrons(),
            int(atom.GetHybridization()),
            int(atom.GetIsAromatic()),
            atom.GetTotalNumHs(),
            atom.GetNumImplicitHs(),
            int(atom.IsInRing()),
        ]
        atom_features.append(features)
    
    x = torch.tensor(atom_features, dtype=torch.float)
    
    # Edge features
    edge_index = []
    edge_attr = []
    for bond in mol.GetBonds():
        i = bond.GetBeginAtomIdx()
        j = bond.GetEndAtomIdx()
        edge_index.extend([[i, j], [j, i]])
        
        bond_features = [
            int(bond.GetBondType()),
            int(bond.GetIsAromatic()),
            int(bond.IsInRing()),
        ]
        edge_attr.extend([bond_features, bond_features])
    
    edge_index = torch.tensor(edge_index, dtype=torch.long).t().contiguous()
    edge_attr = torch.tensor(edge_attr, dtype=torch.float)
    
    return Data(x=x, edge_index=edge_index, edge_attr=edge_attr)

# Test with a few solvents
test_smiles = smiles_df.head(5)
for _, row in test_smiles.iterrows():
    name = row['SOLVENT NAME']
    smiles = row['solvent smiles']
    graph = smiles_to_graph(smiles)
    if graph is not None:
        print(f'{name}: {graph.x.shape[0]} atoms, {graph.edge_index.shape[1]} edges')
    else:
        print(f'{name}: FAILED to parse SMILES')

In [None]:
# The GNN approach is viable!
# All solvents can be converted to molecular graphs.
#
# However, implementing a full GNN is complex and time-consuming.
# Let me think about what's the SIMPLEST approach that could work.
#
# Key insight: The problem is that our features don't capture
# something important about the test solvents.
#
# What if we use a DIFFERENT validation strategy?
# - Instead of leave-one-solvent-out, use random splits
# - This would allow learned embeddings to work
# - But this violates the competition rules!
#
# So we MUST use leave-one-solvent-out CV.
# This means we MUST use features that generalize to unseen solvents.

print('=== CONCLUSION ===')
print('''\nFor leave-one-solvent-out CV, we MUST use features that generalize to unseen solvents.

Options:
1. Fixed molecular descriptors (Spange, DRFP, ACS PCA) - ALREADY DOING THIS
2. GNN features from SMILES - WOULD WORK but complex
3. Learned embeddings - DO NOT WORK (test solvent never seen)

The CV-LB gap is NOT due to feature limitations that learned embeddings could fix.
It's due to something else - possibly:
1. Distribution shift between train and test
2. Systematic bias in our predictions
3. Different evaluation procedure on LB

Since learned embeddings don't work, and GNN is complex,
let's focus on what we CAN do:
1. Better feature engineering with existing descriptors
2. Calibration / post-processing to reduce systematic bias
3. Ensemble diversity to capture different patterns''')

In [None]:
# Let's analyze what's different about the LB evaluation
#
# The CV-LB relationship: LB = 4.27*CV + 0.0527
# This means:
# - LB is ~4x worse than CV
# - There's a constant offset of 0.0527
#
# The offset suggests systematic bias that doesn't depend on model quality.
# This could be due to:
# 1. Different solvents in LB vs local CV
# 2. Different data distribution
# 3. Different evaluation metric

print('=== ANALYZING THE SYSTEMATIC BIAS ===')
print(f'\nCV-LB relationship: LB = 4.27*CV + 0.0527')
print(f'\nThe intercept (0.0527) represents systematic bias.')
print(f'This is {0.0527 / 0.0347:.1f}x larger than the target!')
print(f'\nPossible causes:')
print('1. LB uses different solvents than local CV')
print('2. LB has different data distribution')
print('3. Our models have systematic prediction bias')
print(f'\nTo reach target (0.0347), we need to either:')
print('1. Reduce the intercept (fix systematic bias)')
print('2. Change the CV-LB relationship entirely (different approach)')

In [None]:
# Final recommendation
print('=== FINAL RECOMMENDATION ===')
print('''\n**The learned embeddings approach FAILED because:**
- Test solvent is never seen during training
- Its embedding is just random initialization
- This is a FUNDAMENTAL flaw for leave-one-solvent-out CV

**What we should try next:**

1. **GNN with AttentiveFP** (PRIORITY 1)
   - Learns from molecular structure, not identity
   - Can generalize to unseen solvents
   - Proven to achieve MSE 0.0039 on this dataset
   - Complex but worth trying

2. **k-NN with Spange similarity** (PRIORITY 2)
   - For test solvent, find k most similar training solvents
   - Weight predictions by similarity
   - Simple to implement
   - May help with systematic bias

3. **Calibration / Post-processing** (PRIORITY 3)
   - Adjust predictions based on solvent similarity to training set
   - Temperature scaling or isotonic regression
   - May reduce the intercept

**What NOT to try:**
- Learned embeddings (PROVEN TO FAIL)
- More regularization (already tried, doesn't help)
- Simpler features (exp_038 proved it hurts)

**Submissions remaining:** 4
**Best LB:** 0.0877
**Target:** 0.0347
**Gap:** 2.53x''')

In [None]:
# Save key findings
print('=== KEY FINDINGS TO RECORD ===')
print('''\n1. Learned embeddings FAIL for leave-one-solvent-out CV because test solvent is never seen during training.

2. GNN works because it learns from molecular STRUCTURE, not IDENTITY.

3. The CV-LB relationship has intercept (0.0527) > target (0.0347), meaning we need to change the relationship, not just improve CV.

4. All solvents can be converted to molecular graphs using RDKit.

5. PyTorch Geometric and AttentiveFP are available for GNN implementation.''')