# Loop 28 Analysis: Understanding the Gap to Target

## Key Questions:
1. What is the actual target score (0.01727) and how was it achieved?
2. Why is there a 5.5x gap between our best LB (0.0956) and target?
3. What fundamentally different approaches might work?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load data
DATA_PATH = '/home/data'
single_df = pd.read_csv(f'{DATA_PATH}/catechol_single_solvent_yields.csv')
full_df = pd.read_csv(f'{DATA_PATH}/catechol_full_data_yields.csv')

print(f"Single solvent: {len(single_df)} rows, {single_df['SOLVENT NAME'].nunique()} solvents")
print(f"Full data: {len(full_df)} rows")
print(f"\nSingle solvent columns: {list(single_df.columns)}")

Single solvent: 656 rows, 24 solvents
Full data: 1227 rows

Single solvent columns: ['EXP NUM', 'Residence Time', 'Temperature', 'SM', 'Product 2', 'Product 3', 'SM SMILES', 'Product 2 SMILES', 'Product 3 SMILES', 'SOLVENT NAME', 'SOLVENT SMILES', 'SOLVENT Ratio', 'Reaction SMILES']


In [2]:
# Analyze the target distribution
print("Target distributions:")
for col in ['Product 2', 'Product 3', 'SM']:
    print(f"\n{col}:")
    print(f"  Single: mean={single_df[col].mean():.4f}, std={single_df[col].std():.4f}")
    print(f"  Full:   mean={full_df[col].mean():.4f}, std={full_df[col].std():.4f}")

Target distributions:

Product 2:
  Single: mean=0.1499, std=0.1431
  Full:   mean=0.1646, std=0.1535

Product 3:
  Single: mean=0.1234, std=0.1315
  Full:   mean=0.1437, std=0.1458

SM:
  Single: mean=0.5222, std=0.3602
  Full:   mean=0.4952, std=0.3794


In [3]:
# Check the variance in targets - this tells us the theoretical minimum MAE
print("\nTheoretical analysis:")
print("If we predict the mean for each target:")

for df, name in [(single_df, 'Single'), (full_df, 'Full')]:
    mae_if_mean = 0
    for col in ['Product 2', 'Product 3', 'SM']:
        mae_col = np.mean(np.abs(df[col] - df[col].mean()))
        mae_if_mean += mae_col
    mae_if_mean /= 3
    print(f"  {name}: MAE if predicting mean = {mae_if_mean:.4f}")


Theoretical analysis:
If we predict the mean for each target:
  Single: MAE if predicting mean = 0.1884
  Full: MAE if predicting mean = 0.2070


In [4]:
# Analyze per-solvent variance
print("\nPer-solvent analysis (single solvent data):")
solvent_stats = []
for solvent in single_df['SOLVENT NAME'].unique():
    subset = single_df[single_df['SOLVENT NAME'] == solvent]
    for col in ['Product 2', 'Product 3', 'SM']:
        solvent_stats.append({
            'solvent': solvent,
            'target': col,
            'mean': subset[col].mean(),
            'std': subset[col].std(),
            'count': len(subset)
        })

stats_df = pd.DataFrame(solvent_stats)
print(f"\nAverage within-solvent std: {stats_df['std'].mean():.4f}")
print(f"This represents the irreducible noise if we perfectly predict per-solvent patterns")


Per-solvent analysis (single solvent data):

Average within-solvent std: 0.1466
This represents the irreducible noise if we perfectly predict per-solvent patterns


In [5]:
# Check if the target (0.01727) is achievable
# The target is the BEST score on the leaderboard
# This means someone achieved MAE of 0.01727 on the CV procedure

print("\n=== TARGET ANALYSIS ===")
print(f"Target score: 0.01727")
print(f"Our best CV: 0.0623")
print(f"Our best LB: 0.0956")
print(f"\nGap analysis:")
print(f"  CV to target: {0.0623 / 0.01727:.1f}x worse")
print(f"  LB to target: {0.0956 / 0.01727:.1f}x worse")

print("\n=== WHAT THIS MEANS ===")
print("The target IS achievable - someone achieved it.")
print("Our models are learning SOLVENT-SPECIFIC patterns that don't generalize.")
print("The winning approach must learn GENERALIZABLE chemical principles.")


=== TARGET ANALYSIS ===
Target score: 0.01727
Our best CV: 0.0623
Our best LB: 0.0956

Gap analysis:
  CV to target: 3.6x worse
  LB to target: 5.5x worse

=== WHAT THIS MEANS ===
The target IS achievable - someone achieved it.
Our models are learning SOLVENT-SPECIFIC patterns that don't generalize.
The winning approach must learn GENERALIZABLE chemical principles.


In [6]:
# Analyze what features might help generalization
print("\n=== FEATURE ANALYSIS ===")

# Load all feature sets
spange = pd.read_csv(f'{DATA_PATH}/spange_descriptors_lookup.csv', index_col=0)
acs_pca = pd.read_csv(f'{DATA_PATH}/acs_pca_descriptors_lookup.csv', index_col=0)
drfp = pd.read_csv(f'{DATA_PATH}/drfps_catechol_lookup.csv', index_col=0)
smiles = pd.read_csv(f'{DATA_PATH}/smiles_lookup.csv', index_col=0)

print(f"Spange descriptors: {spange.shape}")
print(f"ACS PCA descriptors: {acs_pca.shape}")
print(f"DRFP: {drfp.shape}")
print(f"SMILES: {smiles.shape}")

print(f"\nSpange columns: {list(spange.columns)}")


=== FEATURE ANALYSIS ===
Spange descriptors: (26, 13)
ACS PCA descriptors: (24, 5)
DRFP: (24, 2048)
SMILES: (26, 1)

Spange columns: ['dielectric constant', 'ET(30)', 'alpha', 'beta', 'pi*', 'SA', 'SB', 'SP', 'SdP', 'N', 'n', 'f(n)', 'delta']


In [7]:
# Check correlation between solvent features and target values
print("\n=== SOLVENT-TARGET CORRELATIONS ===")

# For each solvent, compute mean target values
solvent_means = single_df.groupby('SOLVENT NAME')[['Product 2', 'Product 3', 'SM']].mean()

# Merge with spange features
merged = solvent_means.join(spange)

print("Correlation between Spange features and mean targets:")
for target in ['Product 2', 'Product 3', 'SM']:
    print(f"\n{target}:")
    corrs = []
    for feat in spange.columns:
        corr = merged[target].corr(merged[feat])
        corrs.append((feat, corr))
    corrs.sort(key=lambda x: abs(x[1]), reverse=True)
    for feat, corr in corrs[:5]:
        print(f"  {feat}: {corr:.3f}")


=== SOLVENT-TARGET CORRELATIONS ===
Correlation between Spange features and mean targets:

Product 2:
  ET(30): 0.727
  SA: 0.715
  alpha: 0.690
  delta: 0.620
  dielectric constant: 0.585

Product 3:
  ET(30): 0.667
  SA: 0.638
  alpha: 0.620
  delta: 0.584
  dielectric constant: 0.570

SM:
  alpha: -0.847
  ET(30): -0.844
  SA: -0.842
  SdP: -0.610
  delta: -0.595


In [8]:
# Analyze the CV-LB relationship
print("\n=== CV-LB RELATIONSHIP ===")
submissions = [
    ('exp_004', 0.0623, 0.0956),
    ('exp_006', 0.0688, 0.0991),
    ('exp_016', 0.0623, 0.0956),
    ('exp_021', 0.0901, 0.1231),
    ('exp_026', 0.0810, 0.1124),
]

cvs = [s[1] for s in submissions]
lbs = [s[2] for s in submissions]

from scipy import stats
slope, intercept, r, p, se = stats.linregress(cvs, lbs)

print(f"Linear fit: LB = {slope:.3f} * CV + {intercept:.4f}")
print(f"R² = {r**2:.3f}")
print(f"\nTo achieve target LB 0.01727:")
print(f"  Required CV = (0.01727 - {intercept:.4f}) / {slope:.3f} = {(0.01727 - intercept) / slope:.4f}")
print(f"\nThis is NEGATIVE, meaning our current approach CANNOT reach the target.")
print(f"We need an approach with a DIFFERENT CV-LB relationship.")


=== CV-LB RELATIONSHIP ===


Linear fit: LB = 0.986 * CV + 0.0333
R² = 0.988

To achieve target LB 0.01727:
  Required CV = (0.01727 - 0.0333) / 0.986 = -0.0162

This is NEGATIVE, meaning our current approach CANNOT reach the target.
We need an approach with a DIFFERENT CV-LB relationship.


In [9]:
# What approaches might have different CV-LB relationship?
print("\n=== APPROACHES THAT MIGHT WORK ===")
print("""
1. GRAPH NEURAL NETWORKS (GNN)
   - Learn molecular structure patterns that generalize
   - Can capture solvent-solute interactions
   - Research shows GNNs achieve best results on molecular property prediction
   
2. PRETRAINED MOLECULAR EMBEDDINGS
   - Use embeddings from models trained on millions of molecules
   - Transfer learning from large chemical databases
   - ChemBERTa, MolBERT, etc.
   
3. PHYSICS-INFORMED MODELS
   - Incorporate known chemical principles
   - Arrhenius kinetics (already tried, helped)
   - Solvent polarity effects
   - Transition state theory
   
4. MULTI-TASK LEARNING
   - Learn shared representations across targets
   - SM, Product 2, Product 3 are chemically related
   
5. DOMAIN ADAPTATION
   - Explicitly model the distribution shift
   - Adversarial training to learn solvent-invariant features
""")


=== APPROACHES THAT MIGHT WORK ===

1. GRAPH NEURAL NETWORKS (GNN)
   - Learn molecular structure patterns that generalize
   - Can capture solvent-solute interactions
   - Research shows GNNs achieve best results on molecular property prediction
   
2. PRETRAINED MOLECULAR EMBEDDINGS
   - Use embeddings from models trained on millions of molecules
   - Transfer learning from large chemical databases
   - ChemBERTa, MolBERT, etc.
   
3. PHYSICS-INFORMED MODELS
   - Incorporate known chemical principles
   - Arrhenius kinetics (already tried, helped)
   - Solvent polarity effects
   - Transition state theory
   
4. MULTI-TASK LEARNING
   - Learn shared representations across targets
   - SM, Product 2, Product 3 are chemically related
   
5. DOMAIN ADAPTATION
   - Explicitly model the distribution shift
   - Adversarial training to learn solvent-invariant features



In [None]:
# Check what the GNN experiment (exp_020) achieved
print("\n=== GNN EXPERIMENT ANALYSIS ===")
print("exp_020 (GNN) achieved CV 0.099")
print("This is WORSE than tree-based models (0.0623)")
print("\nPossible reasons:")
print("1. GNN architecture was too simple (3 GCN layers)")
print("2. No pretrained embeddings")
print("3. Insufficient training (only 100 epochs)")
print("4. Wrong hyperparameters")
print("\nWhat to try next:")
print("1. Use pretrained GNN (e.g., from ChemBERTa)")
print("2. Graph Attention Networks (GAT) instead of GCN")
print("3. More sophisticated molecular featurization")
print("4. Longer training with proper learning rate schedule")