# Loop 37 Analysis: Critical Strategic Assessment

**Situation:**
- 37 experiments completed
- 11 submissions made, 5 remaining
- Best LB: 0.0877 (exp_030)
- Target: 0.0347
- Gap: 2.53x (153% worse)

**Key Research Finding:**
A GNN benchmark paper (arXiv:2512.19530) achieved **MSE of 0.0039** on the Catechol benchmark using:
- Graph Attention Networks (GATs)
- Differential Reaction Fingerprints (DRFP)
- Learned mixture-aware solvent encodings

This is 22x better than our best LB (0.0877)! The target (0.0347) is 8.9x worse than this GNN result.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

# All submission history
submissions = [
    {'exp': 'exp_000', 'cv': 0.0111, 'lb': 0.0982},
    {'exp': 'exp_001', 'cv': 0.0123, 'lb': 0.1065},
    {'exp': 'exp_003', 'cv': 0.0105, 'lb': 0.0972},
    {'exp': 'exp_005', 'cv': 0.0104, 'lb': 0.0969},
    {'exp': 'exp_006', 'cv': 0.0097, 'lb': 0.0946},
    {'exp': 'exp_007', 'cv': 0.0093, 'lb': 0.0932},
    {'exp': 'exp_009', 'cv': 0.0092, 'lb': 0.0936},
    {'exp': 'exp_012', 'cv': 0.0090, 'lb': 0.0913},
    {'exp': 'exp_024', 'cv': 0.0087, 'lb': 0.0893},
    {'exp': 'exp_026', 'cv': 0.0085, 'lb': 0.0887},
    {'exp': 'exp_030', 'cv': 0.0083, 'lb': 0.0877},
]

df = pd.DataFrame(submissions)
print('All submissions:')
print(df.to_string(index=False))

# Linear regression on CV-LB relationship
cv_vals = df['cv'].values
lb_vals = df['lb'].values

slope, intercept, r_value, p_value, std_err = stats.linregress(cv_vals, lb_vals)

print(f'\n=== CV-LB Linear Fit ===')
print(f'LB = {slope:.2f} * CV + {intercept:.4f}')
print(f'R² = {r_value**2:.4f}')
print(f'Intercept: {intercept:.4f}')
print(f'Target: 0.0347')

All submissions:
    exp     cv     lb
exp_000 0.0111 0.0982
exp_001 0.0123 0.1065
exp_003 0.0105 0.0972
exp_005 0.0104 0.0969
exp_006 0.0097 0.0946
exp_007 0.0093 0.0932
exp_009 0.0092 0.0936
exp_012 0.0090 0.0913
exp_024 0.0087 0.0893
exp_026 0.0085 0.0887
exp_030 0.0083 0.0877

=== CV-LB Linear Fit ===
LB = 4.30 * CV + 0.0524
R² = 0.9675
Intercept: 0.0524
Target: 0.0347


In [2]:
# Key insight: What would it take to reach the target?
print('\n=== CRITICAL ANALYSIS ===')
print(f'\nCurrent CV-LB relationship: LB = {slope:.2f} * CV + {intercept:.4f}')
print(f'Intercept ({intercept:.4f}) > Target (0.0347)')
print(f'\nThis means with our current approach, even CV=0 would give LB={intercept:.4f}')
print(f'\nBUT: The GNN benchmark achieved MSE 0.0039 on this exact dataset!')
print(f'This is {0.0877/0.0039:.1f}x better than our best LB (0.0877)')
print(f'\nThe target (0.0347) is {0.0347/0.0039:.1f}x worse than the GNN result')
print(f'This suggests the target is VERY achievable with the right approach!')


=== CRITICAL ANALYSIS ===

Current CV-LB relationship: LB = 4.30 * CV + 0.0524
Intercept (0.0524) > Target (0.0347)

This means with our current approach, even CV=0 would give LB=0.0524

BUT: The GNN benchmark achieved MSE 0.0039 on this exact dataset!
This is 22.5x better than our best LB (0.0877)

The target (0.0347) is 8.9x worse than the GNN result
This suggests the target is VERY achievable with the right approach!


In [3]:
# What's different about the GNN approach?
print('\n=== WHAT THE GNN BENCHMARK DID DIFFERENTLY ===')
print('''
1. **Graph Attention Networks (GATs)**: 
   - Message-passing on molecular graphs
   - Learns solvent-reactant interactions
   - Captures structural information that tabular methods miss

2. **Continuous Mixture Encoding**:
   - Learned embeddings for solvent mixtures
   - Not just linear interpolation of features
   - Captures non-linear mixture effects

3. **DRFP Integration**:
   - Differential Reaction Fingerprints
   - We use DRFP but in a tabular way
   - GNN uses it with graph structure

4. **Key Insight**:
   - Our approach treats solvents as feature vectors
   - GNN approach treats solvents as GRAPHS
   - The graph structure captures information we're missing
''')

print('\n=== WHY OUR APPROACH HAS A CEILING ===')
print('''
Our models (MLP, LGBM, GP) all use tabular features:
- Spange descriptors (13 features)
- DRFP (122 features)
- ACS PCA (5 features)
- Arrhenius kinetics (5 features)

These features are FIXED representations of solvents.
They cannot capture:
- Solvent-reactant interactions
- Non-linear mixture effects
- Structural patterns in molecular graphs

The CV-LB gap (intercept 0.0524) represents the information
our features CANNOT capture about unseen solvents.
''')

print('\n=== WHAT WE CAN DO ===')
print('''
1. **We CANNOT implement a full GNN** (too complex, time-limited)

2. **We CAN try to approximate GNN benefits:**
   a. Better mixture encoding (not just linear interpolation)
   b. Per-solvent learned embeddings
   c. Attention mechanisms on features
   d. More sophisticated feature interactions

3. **We CAN try fundamentally different approaches:**
   a. k-NN with learned distance metrics
   b. Prototype-based learning
   c. Meta-learning across solvents
''')


=== WHAT THE GNN BENCHMARK DID DIFFERENTLY ===

1. **Graph Attention Networks (GATs)**: 
   - Message-passing on molecular graphs
   - Learns solvent-reactant interactions
   - Captures structural information that tabular methods miss

2. **Continuous Mixture Encoding**:
   - Learned embeddings for solvent mixtures
   - Not just linear interpolation of features
   - Captures non-linear mixture effects

3. **DRFP Integration**:
   - Differential Reaction Fingerprints
   - We use DRFP but in a tabular way
   - GNN uses it with graph structure

4. **Key Insight**:
   - Our approach treats solvents as feature vectors
   - GNN approach treats solvents as GRAPHS
   - The graph structure captures information we're missing


=== WHY OUR APPROACH HAS A CEILING ===

Our models (MLP, LGBM, GP) all use tabular features:
- Spange descriptors (13 features)
- DRFP (122 features)
- ACS PCA (5 features)
- Arrhenius kinetics (5 features)

These features are FIXED representations of solvents.
They canno

In [4]:
# What experiments haven't we tried?
print('\n=== UNEXPLORED APPROACHES ===')
print('''
1. **Learned Solvent Embeddings** (NOT TRIED)
   - Instead of fixed features, learn embeddings during training
   - Similar to word embeddings in NLP
   - Could capture solvent-specific patterns

2. **Non-linear Mixture Encoding** (NOT TRIED)
   - Current: mixture_feat = (1-pct)*A + pct*B
   - Try: mixture_feat = f(A, B, pct) where f is learned
   - Could capture non-linear mixture effects

3. **Attention-based Feature Weighting** (TRIED BUT FAILED)
   - exp_017 tried attention but failed
   - Maybe simpler attention could work

4. **k-NN with Solvent Similarity** (NOT TRIED PROPERLY)
   - exp_037 tried similarity weighting but had bugs
   - k-NN might have different CV-LB relationship

5. **Per-Target Specialized Models** (NOT TRIED)
   - SM, Product 2, Product 3 might need different models
   - Could reduce overall error

6. **Prediction Calibration** (NOT TRIED)
   - Scale predictions toward training mean
   - Could reduce systematic bias
''')

print('\n=== PRIORITY RANKING ===')
print('''
1. **HIGHEST PRIORITY: Learned Solvent Embeddings**
   - Most similar to what GNN does
   - Could capture solvent-specific patterns
   - Relatively simple to implement

2. **HIGH PRIORITY: Non-linear Mixture Encoding**
   - Current linear interpolation is too simple
   - Could capture mixture-specific effects

3. **MEDIUM PRIORITY: k-NN with Proper Implementation**
   - Different inductive bias
   - Might have different CV-LB relationship

4. **LOW PRIORITY: Per-Target Models**
   - Incremental improvement
   - Won't change CV-LB relationship
''')


=== UNEXPLORED APPROACHES ===

1. **Learned Solvent Embeddings** (NOT TRIED)
   - Instead of fixed features, learn embeddings during training
   - Similar to word embeddings in NLP
   - Could capture solvent-specific patterns

2. **Non-linear Mixture Encoding** (NOT TRIED)
   - Current: mixture_feat = (1-pct)*A + pct*B
   - Try: mixture_feat = f(A, B, pct) where f is learned
   - Could capture non-linear mixture effects

3. **Attention-based Feature Weighting** (TRIED BUT FAILED)
   - exp_017 tried attention but failed
   - Maybe simpler attention could work

4. **k-NN with Solvent Similarity** (NOT TRIED PROPERLY)
   - exp_037 tried similarity weighting but had bugs
   - k-NN might have different CV-LB relationship

5. **Per-Target Specialized Models** (NOT TRIED)
   - SM, Product 2, Product 3 might need different models
   - Could reduce overall error

6. **Prediction Calibration** (NOT TRIED)
   - Scale predictions toward training mean
   - Could reduce systematic bias


=== PRIORI

In [5]:
# Final recommendation
print('\n' + '='*70)
print('FINAL RECOMMENDATION FOR LOOP 37')
print('='*70)
print(f'''
**SITUATION:**
- 5 submissions remaining
- Best LB: 0.0877 (exp_030)
- Target: 0.0347 (2.53x gap)
- GNN benchmark achieved 0.0039 (22x better than us!)

**KEY INSIGHT:**
The GNN benchmark proves that MSE < 0.01 is achievable on this dataset.
Our CV-LB gap is NOT a fundamental limit - it's a limitation of our approach.

**WHAT TO TRY NEXT:**

1. **Learned Solvent Embeddings + MLP**
   - Create learnable embedding layer for each solvent
   - Train embeddings jointly with MLP
   - This approximates what GNN does with graph structure

2. **Non-linear Mixture Encoding**
   - Instead of linear interpolation, use a small network
   - mixture_feat = MLP([A_feat, B_feat, pct])
   - Captures non-linear mixture effects

3. **Submit exp_035 (Best CV Model)**
   - CV 0.008194 is our best
   - Predicted LB ~0.0876 (marginal improvement)
   - Use this as baseline for comparison

**SUBMISSION STRATEGY:**
- Submit exp_035 first (best CV, verify CV-LB relationship)
- Try learned embeddings approach
- If CV improves significantly, submit that
- Reserve 2-3 submissions for best models
''')

print('\n=== THE TARGET IS REACHABLE ===')
print('''
The GNN benchmark (MSE 0.0039) proves that excellent performance is possible.
Our target (0.0347) is 8.9x worse than the GNN result.
This means the target is VERY conservative and DEFINITELY achievable.

We need to find an approach that:
1. Captures solvent-specific patterns (like GNN does)
2. Handles mixture effects non-linearly
3. Generalizes to unseen solvents

Learned embeddings are our best bet for approximating GNN benefits
without implementing a full graph neural network.
''')


FINAL RECOMMENDATION FOR LOOP 37

**SITUATION:**
- 5 submissions remaining
- Best LB: 0.0877 (exp_030)
- Target: 0.0347 (2.53x gap)
- GNN benchmark achieved 0.0039 (22x better than us!)

**KEY INSIGHT:**
The GNN benchmark proves that MSE < 0.01 is achievable on this dataset.
Our CV-LB gap is NOT a fundamental limit - it's a limitation of our approach.

**WHAT TO TRY NEXT:**

1. **Learned Solvent Embeddings + MLP**
   - Create learnable embedding layer for each solvent
   - Train embeddings jointly with MLP
   - This approximates what GNN does with graph structure

2. **Non-linear Mixture Encoding**
   - Instead of linear interpolation, use a small network
   - mixture_feat = MLP([A_feat, B_feat, pct])
   - Captures non-linear mixture effects

3. **Submit exp_035 (Best CV Model)**
   - CV 0.008194 is our best
   - Predicted LB ~0.0876 (marginal improvement)
   - Use this as baseline for comparison

**SUBMISSION STRATEGY:**
- Submit exp_035 first (best CV, verify CV-LB relationship)
- T