# Loop 19 Strategic Analysis

## Key Situation
- **Best CV**: 0.0623 (exp_004/017/019)
- **Best LB**: 0.0956 (exp_004/016)
- **Target**: 0.01727 (5.5x away from best LB)
- **Submissions remaining**: 2
- **CV-LB gap**: ~53% (0.0623 → 0.0956)

## Critical Question
Can we reach the target (0.01727) with 2 submissions remaining?

## Analysis Goals
1. Understand why CV-LB gap is so large
2. Identify what approaches could bridge the gap
3. Decide: Submit exp_019 or try something fundamentally different?

In [1]:
import pandas as pd
import numpy as np

# Load data to understand the problem
DATA_PATH = '/home/data'

df_single = pd.read_csv(f'{DATA_PATH}/catechol_single_solvent_yields.csv')
df_full = pd.read_csv(f'{DATA_PATH}/catechol_full_data_yields.csv')

print("=== DATA OVERVIEW ===")
print(f"Single solvent: {len(df_single)} samples, {df_single['SOLVENT NAME'].nunique()} solvents")
print(f"Full data: {len(df_full)} samples, {df_full[['SOLVENT A NAME', 'SOLVENT B NAME']].drop_duplicates().shape[0]} ramps")

# Check solvent overlap
single_solvents = set(df_single['SOLVENT NAME'].unique())
full_solvents_a = set(df_full['SOLVENT A NAME'].unique())
full_solvents_b = set(df_full['SOLVENT B NAME'].unique())
full_solvents = full_solvents_a | full_solvents_b

print(f"\n=== SOLVENT ANALYSIS ===")
print(f"Single solvent unique: {len(single_solvents)}")
print(f"Full data unique solvents: {len(full_solvents)}")
print(f"Overlap: {len(single_solvents & full_solvents)}")

=== DATA OVERVIEW ===
Single solvent: 656 samples, 24 solvents
Full data: 1227 samples, 13 ramps

=== SOLVENT ANALYSIS ===
Single solvent unique: 24
Full data unique solvents: 24
Overlap: 24


In [2]:
# Analyze the CV-LB gap
print("=== CV-LB GAP ANALYSIS ===")
print("\nSubmission History:")
print("exp_004: CV 0.0623 → LB 0.0956 (53% worse)")
print("exp_006: CV 0.0689 → LB 0.0991 (44% worse)")
print("exp_016: CV 0.0623 → LB 0.0956 (53% worse)")
print("exp_019: CV 0.0624 → LB ??? (not submitted)")

print("\n=== KEY INSIGHT ===")
print("The CV-LB gap is CONSISTENT (~50%) across different models.")
print("This suggests the test set has fundamentally different solvents.")
print("\nTo reach target 0.01727:")
print(f"  - If CV-LB gap is 50%, need CV = 0.01727 / 1.5 = 0.0115")
print(f"  - Current best CV is 0.0623, need 5.4x improvement")
print(f"  - If CV-LB gap is 0%, need CV = 0.01727")
print(f"  - Current best CV is 0.0623, need 3.6x improvement")

=== CV-LB GAP ANALYSIS ===

Submission History:
exp_004: CV 0.0623 → LB 0.0956 (53% worse)
exp_006: CV 0.0689 → LB 0.0991 (44% worse)
exp_016: CV 0.0623 → LB 0.0956 (53% worse)
exp_019: CV 0.0624 → LB ??? (not submitted)

=== KEY INSIGHT ===
The CV-LB gap is CONSISTENT (~50%) across different models.
This suggests the test set has fundamentally different solvents.

To reach target 0.01727:
  - If CV-LB gap is 50%, need CV = 0.01727 / 1.5 = 0.0115
  - Current best CV is 0.0623, need 5.4x improvement
  - If CV-LB gap is 0%, need CV = 0.01727
  - Current best CV is 0.0623, need 3.6x improvement


In [3]:
# Analyze what the top kernels are doing
print("=== TOP KERNEL ANALYSIS ===")
print("\n1. 'mixall' kernel (lishellliang):")
print("   - Uses GroupKFold (5-fold) instead of LOO")
print("   - MLP + XGBoost + RF + LightGBM ensemble")
print("   - Claims 'good CV/LB' but doesn't specify scores")
print("   - Key: OVERWRITES generate_leave_one_out_splits!")

print("\n2. 'System Malfunction V1' kernel (omarafik):")
print("   - Simple MLP with 300 epochs")
print("   - Learning rate scheduling (ReduceLROnPlateau)")
print("   - Gradient clipping (1.0)")
print("   - Uses MSE loss")

print("\n3. 'Arrhenius Kinetics + TTA' kernel:")
print("   - LB score: 0.09831")
print("   - Uses TTA (Test Time Augmentation)")
print("   - Similar to our exp_001")

print("\n=== CONCLUSION ===")
print("Top public kernels achieve LB ~0.09-0.10")
print("Our best LB (0.0956) is COMPETITIVE with public kernels")
print("The target (0.01727) is 5.5x better than best public LB")
print("This suggests the target requires FUNDAMENTALLY DIFFERENT approach")

=== TOP KERNEL ANALYSIS ===

1. 'mixall' kernel (lishellliang):
   - Uses GroupKFold (5-fold) instead of LOO
   - MLP + XGBoost + RF + LightGBM ensemble
   - Claims 'good CV/LB' but doesn't specify scores
   - Key: OVERWRITES generate_leave_one_out_splits!

2. 'System Malfunction V1' kernel (omarafik):
   - Simple MLP with 300 epochs
   - Learning rate scheduling (ReduceLROnPlateau)
   - Gradient clipping (1.0)
   - Uses MSE loss

3. 'Arrhenius Kinetics + TTA' kernel:
   - LB score: 0.09831
   - Uses TTA (Test Time Augmentation)
   - Similar to our exp_001

=== CONCLUSION ===
Top public kernels achieve LB ~0.09-0.10
Our best LB (0.0956) is COMPETITIVE with public kernels
The target (0.01727) is 5.5x better than best public LB
This suggests the target requires FUNDAMENTALLY DIFFERENT approach


In [4]:
# What approaches haven't been tried?
print("=== APPROACHES NOT YET TRIED ===")
print("\n1. Graph Neural Networks (GNN)")
print("   - Research shows GNN with molecular graphs achieves best results")
print("   - Can encode solvent-solute interactions explicitly")
print("   - Pre-training on large reaction corpora helps")
print("   - MMGNN framework for solvation free energy prediction")

print("\n2. Transformer-based models")
print("   - BERT/RoBERTa on reaction SMILES")
print("   - Pre-trained chemical transformers (ChemBERTa)")
print("   - Can capture long-range dependencies")

print("\n3. Transfer learning")
print("   - Pre-train on large reaction databases (USPTO)")
print("   - Fine-tune on Catechol data")
print("   - Can help with OOD generalization")

print("\n4. Physics-informed features")
print("   - Quantum chemical descriptors")
print("   - Transition state modeling")
print("   - Solvent-solute interaction energies")

print("\n=== CRITICAL INSIGHT ===")
print("The paper arxiv:2512.19530 achieved MSE 0.0039 using GNN")
print("This is ~25x better than our best LB (0.0956)")
print("GNN is the ONLY approach that has demonstrated target-level performance")

=== APPROACHES NOT YET TRIED ===

1. Graph Neural Networks (GNN)
   - Research shows GNN with molecular graphs achieves best results
   - Can encode solvent-solute interactions explicitly
   - Pre-training on large reaction corpora helps
   - MMGNN framework for solvation free energy prediction

2. Transformer-based models
   - BERT/RoBERTa on reaction SMILES
   - Pre-trained chemical transformers (ChemBERTa)
   - Can capture long-range dependencies

3. Transfer learning
   - Pre-train on large reaction databases (USPTO)
   - Fine-tune on Catechol data
   - Can help with OOD generalization

4. Physics-informed features
   - Quantum chemical descriptors
   - Transition state modeling
   - Solvent-solute interaction energies

=== CRITICAL INSIGHT ===
The paper arxiv:2512.19530 achieved MSE 0.0039 using GNN
This is ~25x better than our best LB (0.0956)
GNN is the ONLY approach that has demonstrated target-level performance


In [5]:
# Strategic decision
print("=== STRATEGIC DECISION ===")
print("\nOption A: Submit exp_019")
print("  - CV 0.0624 (same as exp_004/017)")
print("  - Expected LB: ~0.095 (same as exp_004/016)")
print("  - Risk: Uses 1 of 2 remaining submissions for no new information")
print("  - Benefit: MLP component MAY reduce CV-LB gap")

print("\nOption B: Try GNN")
print("  - Only approach with demonstrated target-level performance")
print("  - Complex to implement correctly")
print("  - May not work in remaining time")
print("  - Risk: High complexity, uncertain outcome")

print("\nOption C: Try simpler MLP with better training")
print("  - 'System Malfunction' kernel uses 300 epochs + LR scheduling")
print("  - Our MLP uses 200 epochs, no scheduling")
print("  - Could improve generalization")
print("  - Risk: Incremental improvement, unlikely to reach target")

print("\n=== RECOMMENDATION ===")
print("Given only 2 submissions remaining:")
print("1. DO NOT submit exp_019 - same CV as exp_004, likely same LB")
print("2. Try MLP with better training (300 epochs, LR scheduling)")
print("3. If that doesn't work, try GNN as last resort")
print("4. Save submissions for approaches that have a chance of reaching target")

=== STRATEGIC DECISION ===

Option A: Submit exp_019
  - CV 0.0624 (same as exp_004/017)
  - Expected LB: ~0.095 (same as exp_004/016)
  - Risk: Uses 1 of 2 remaining submissions for no new information
  - Benefit: MLP component MAY reduce CV-LB gap

Option B: Try GNN
  - Only approach with demonstrated target-level performance
  - Complex to implement correctly
  - May not work in remaining time
  - Risk: High complexity, uncertain outcome

Option C: Try simpler MLP with better training
  - 'System Malfunction' kernel uses 300 epochs + LR scheduling
  - Our MLP uses 200 epochs, no scheduling
  - Could improve generalization
  - Risk: Incremental improvement, unlikely to reach target

=== RECOMMENDATION ===
Given only 2 submissions remaining:
1. DO NOT submit exp_019 - same CV as exp_004, likely same LB
2. Try MLP with better training (300 epochs, LR scheduling)
3. If that doesn't work, try GNN as last resort
4. Save submissions for approaches that have a chance of reaching target


In [6]:
# What would it take to reach the target?
print("=== PATH TO TARGET ===")
print("\nTarget: 0.01727 (MAE)")
print("Best LB: 0.0956")
print("Gap: 5.5x")

print("\nTo reach target, we need:")
print("1. MUCH better generalization to unseen solvents")
print("2. Model that captures solvent-solute interactions")
print("3. Possibly physics-informed features")

print("\n=== REALISTIC ASSESSMENT ===")
print("With tree-based models + MLP:")
print("  - Best achievable LB: ~0.09 (based on public kernels)")
print("  - This is 5x away from target")

print("\nWith GNN (based on paper):")
print("  - Achievable: MSE 0.0039 (MAE ~0.06)")
print("  - Still 3.5x away from target")

print("\n=== CONCLUSION ===")
print("The target (0.01727) may require:")
print("1. GNN with pre-training on large reaction databases")
print("2. Physics-informed features (quantum descriptors)")
print("3. Ensemble of multiple GNN architectures")
print("4. Possibly data augmentation or transfer learning")
print("\nThis is a SIGNIFICANT undertaking that may not be achievable")
print("in the remaining time with 2 submissions.")

=== PATH TO TARGET ===

Target: 0.01727 (MAE)
Best LB: 0.0956
Gap: 5.5x

To reach target, we need:
1. MUCH better generalization to unseen solvents
2. Model that captures solvent-solute interactions
3. Possibly physics-informed features

=== REALISTIC ASSESSMENT ===
With tree-based models + MLP:
  - Best achievable LB: ~0.09 (based on public kernels)
  - This is 5x away from target

With GNN (based on paper):
  - Achievable: MSE 0.0039 (MAE ~0.06)
  - Still 3.5x away from target

=== CONCLUSION ===
The target (0.01727) may require:
1. GNN with pre-training on large reaction databases
2. Physics-informed features (quantum descriptors)
3. Ensemble of multiple GNN architectures
4. Possibly data augmentation or transfer learning

This is a SIGNIFICANT undertaking that may not be achievable
in the remaining time with 2 submissions.
