# Loop 5 Strategic Analysis

## Situation Summary
- **Target**: 0.023 (lower is better)
- **Best LB**: 0.0972 (exp_003)
- **Best CV**: 0.0105 (exp_003)
- **CV-LB Gap**: ~9x ratio consistently
- **Submissions used**: 3/5 (2 remaining)

## Key Observations
1. Deep Residual MLP FAILED badly (0.0519 vs 0.0105 baseline) - complexity hurts
2. CV improvements translate poorly to LB (5.4% CV → 1% LB)
3. The 9x CV-LB gap is consistent across all submissions
4. To beat target 0.023, need LB improvement of 76% from current 0.0972

In [1]:
import pandas as pd
import numpy as np

# Submission history analysis
submissions = [
    {'exp': 'exp_000', 'name': 'MLP Baseline (Spange + Arrhenius)', 'cv': 0.0111, 'lb': 0.0982},
    {'exp': 'exp_001', 'name': 'LightGBM (Spange + Arrhenius)', 'cv': 0.0123, 'lb': 0.1065},
    {'exp': 'exp_003', 'name': 'Combined Spange + DRFP + Arrhenius', 'cv': 0.0105, 'lb': 0.0972},
]

df = pd.DataFrame(submissions)
df['cv_lb_ratio'] = df['lb'] / df['cv']
print('=== SUBMISSION HISTORY ===')
print(df.to_string(index=False))
print(f'\nAverage CV-LB ratio: {df["cv_lb_ratio"].mean():.2f}x')

=== SUBMISSION HISTORY ===
    exp                               name     cv     lb  cv_lb_ratio
exp_000  MLP Baseline (Spange + Arrhenius) 0.0111 0.0982     8.846847
exp_001      LightGBM (Spange + Arrhenius) 0.0123 0.1065     8.658537
exp_003 Combined Spange + DRFP + Arrhenius 0.0105 0.0972     9.257143

Average CV-LB ratio: 8.92x


In [2]:
# Target analysis
target = 0.023
best_lb = 0.0972
best_cv = 0.0105
avg_ratio = 8.9

print('=== TARGET ANALYSIS ===')
print(f'Target LB: {target}')
print(f'Best LB: {best_lb}')
print(f'Gap to target: {best_lb - target:.4f} ({(best_lb - target) / target * 100:.1f}% above target)')
print(f'Improvement needed: {(best_lb - target) / best_lb * 100:.1f}%')

print('\n=== WHAT CV WOULD WE NEED? ===')
required_cv = target / avg_ratio
print(f'Using avg ratio ({avg_ratio}x): CV = {required_cv:.5f}')
print(f'Current best CV: {best_cv}')
print(f'CV improvement needed: {(best_cv - required_cv) / best_cv * 100:.1f}%')

=== TARGET ANALYSIS ===
Target LB: 0.023
Best LB: 0.0972
Gap to target: 0.0742 (322.6% above target)
Improvement needed: 76.3%

=== WHAT CV WOULD WE NEED? ===
Using avg ratio (8.9x): CV = 0.00258
Current best CV: 0.0105
CV improvement needed: 75.4%


In [3]:
# Experiment history
experiments = [
    {'id': 'exp_000', 'name': 'Baseline MLP', 'cv': 0.011081, 'lb': 0.0982, 'status': 'submitted'},
    {'id': 'exp_001', 'name': 'LightGBM', 'cv': 0.012297, 'lb': 0.1065, 'status': 'submitted'},
    {'id': 'exp_002', 'name': 'DRFP MLP with PCA', 'cv': 0.016948, 'lb': None, 'status': 'not submitted'},
    {'id': 'exp_003', 'name': 'Combined Spange+DRFP', 'cv': 0.010501, 'lb': 0.0972, 'status': 'submitted'},
    {'id': 'exp_004', 'name': 'Deep Residual MLP', 'cv': 0.051912, 'lb': None, 'status': 'FAILED'},
]

print('=== EXPERIMENT HISTORY ===')
for exp in experiments:
    status = exp['status']
    lb_str = f"{exp['lb']:.4f}" if exp['lb'] else 'N/A'
    print(f"{exp['id']}: {exp['name']:25s} CV={exp['cv']:.6f} LB={lb_str:8s} [{status}]")

=== EXPERIMENT HISTORY ===
exp_000: Baseline MLP              CV=0.011081 LB=0.0982   [submitted]
exp_001: LightGBM                  CV=0.012297 LB=0.1065   [submitted]
exp_002: DRFP MLP with PCA         CV=0.016948 LB=N/A      [not submitted]
exp_003: Combined Spange+DRFP      CV=0.010501 LB=0.0972   [submitted]
exp_004: Deep Residual MLP         CV=0.051912 LB=N/A      [FAILED]


In [4]:
# What approaches have been tried?
print('=== APPROACHES TRIED ===')
print('\n1. FEATURES:')
print('   ✓ Spange descriptors (13 features) - WORKS WELL')
print('   ✓ DRFP with PCA (100 components) - WORSE than Spange')
print('   ✓ DRFP with variance selection (122 features) - SLIGHT IMPROVEMENT')
print('   ✓ Combined Spange + DRFP - BEST SO FAR')
print('   ✓ Arrhenius kinetics (1/T, ln(t), interaction) - ESSENTIAL')

print('\n2. MODELS:')
print('   ✓ MLP [128, 128, 64] with dropout 0.2 - BASELINE')
print('   ✓ MLP [256, 128, 64] with dropout 0.3 - SLIGHTLY BETTER')
print('   ✓ LightGBM - WORSE than MLP')
print('   ✗ Deep Residual MLP [512, 256, 128, 64] - FAILED BADLY')

print('\n3. TECHNIQUES:')
print('   ✓ Data augmentation (flip A/B for mixtures) - ESSENTIAL')
print('   ✓ TTA (average both orderings) - ESSENTIAL')
print('   ✓ Bagging (3-5 models) - HELPS')
print('   ✓ HuberLoss - HELPS')
print('   ✓ Gradient clipping - HELPS')

=== APPROACHES TRIED ===

1. FEATURES:
   ✓ Spange descriptors (13 features) - WORKS WELL
   ✓ DRFP with PCA (100 components) - WORSE than Spange
   ✓ DRFP with variance selection (122 features) - SLIGHT IMPROVEMENT
   ✓ Combined Spange + DRFP - BEST SO FAR
   ✓ Arrhenius kinetics (1/T, ln(t), interaction) - ESSENTIAL

2. MODELS:
   ✓ MLP [128, 128, 64] with dropout 0.2 - BASELINE
   ✓ MLP [256, 128, 64] with dropout 0.3 - SLIGHTLY BETTER
   ✓ LightGBM - WORSE than MLP
   ✗ Deep Residual MLP [512, 256, 128, 64] - FAILED BADLY

3. TECHNIQUES:
   ✓ Data augmentation (flip A/B for mixtures) - ESSENTIAL
   ✓ TTA (average both orderings) - ESSENTIAL
   ✓ Bagging (3-5 models) - HELPS
   ✓ HuberLoss - HELPS
   ✓ Gradient clipping - HELPS


In [5]:
# What hasn't been tried?
print('=== APPROACHES NOT YET TRIED ===')
print('\n1. FEATURES:')
print('   - ACS PCA descriptors (alternative to Spange)')
print('   - Fragprints (concatenation of fragments + fingerprints)')
print('   - Per-target feature selection')
print('   - Polynomial features / interactions')

print('\n2. MODELS:')
print('   - Gaussian Processes with domain-specific kernels')
print('   - Per-target models (different model for SM vs Products)')
print('   - Attention mechanisms (without full GNN)')
print('   - Simpler architectures (even smaller networks)')

print('\n3. TECHNIQUES:')
print('   - Larger ensembles (10-20 models with SAME architecture)')
print('   - Different loss functions (MSE, MAE, quantile)')
print('   - Learning rate warmup')
print('   - Stochastic Weight Averaging (SWA)')
print('   - Snapshot ensembles')

=== APPROACHES NOT YET TRIED ===

1. FEATURES:
   - ACS PCA descriptors (alternative to Spange)
   - Fragprints (concatenation of fragments + fingerprints)
   - Per-target feature selection
   - Polynomial features / interactions

2. MODELS:
   - Gaussian Processes with domain-specific kernels
   - Per-target models (different model for SM vs Products)
   - Attention mechanisms (without full GNN)
   - Simpler architectures (even smaller networks)

3. TECHNIQUES:
   - Larger ensembles (10-20 models with SAME architecture)
   - Different loss functions (MSE, MAE, quantile)
   - Learning rate warmup
   - Stochastic Weight Averaging (SWA)
   - Snapshot ensembles


In [6]:
# Key insight: The CV-LB gap
print('=== CRITICAL INSIGHT: CV-LB GAP ===')
print('\nThe 9x CV-LB gap is the fundamental problem.')
print('\nPossible causes:')
print('1. Model variance: Different random seeds on Kaggle produce different results')
print('2. Distribution shift: Test solvents have different characteristics')
print('3. Overfitting to CV: Model memorizes training patterns that don\'t generalize')
print('\nEvidence:')
print('- LightGBM (deterministic) had WORSE LB than MLP (stochastic)')
print('- This suggests the gap is NOT just about variance')
print('- The gap may be inherent to the leave-one-solvent-out problem')
print('\nImplication:')
print('- Improving local CV may not help much')
print('- Need to focus on approaches that generalize better to unseen solvents')
print('- Or accept that the target (0.023) may be unrealistic for MLP-based approaches')

=== CRITICAL INSIGHT: CV-LB GAP ===

The 9x CV-LB gap is the fundamental problem.

Possible causes:
1. Model variance: Different random seeds on Kaggle produce different results
2. Distribution shift: Test solvents have different characteristics
3. Overfitting to CV: Model memorizes training patterns that don't generalize

Evidence:
- LightGBM (deterministic) had WORSE LB than MLP (stochastic)
- This suggests the gap is NOT just about variance
- The gap may be inherent to the leave-one-solvent-out problem

Implication:
- Improving local CV may not help much
- Need to focus on approaches that generalize better to unseen solvents
- Or accept that the target (0.023) may be unrealistic for MLP-based approaches


In [7]:
# What the GNN benchmark achieved
print('=== GNN BENCHMARK REFERENCE ===')
print('\nThe GNN benchmark (arXiv 2512.19530) achieved MSE 0.0039')
print('\nKey differences from our approach:')
print('1. Used Graph Attention Networks with molecular graph message-passing')
print('2. DRFP features 2048-dim (we use 122 high-variance)')
print('3. Mixture-aware continuous solvent representation')
print('4. Transfer learning from larger chemical datasets')
print('\nWhat this tells us:')
print('- The problem IS solvable with much better accuracy')
print('- But requires fundamentally different architecture (GNN)')
print('- Simple MLP may have a ceiling around 0.01 CV / 0.09 LB')

=== GNN BENCHMARK REFERENCE ===

The GNN benchmark (arXiv 2512.19530) achieved MSE 0.0039

Key differences from our approach:
1. Used Graph Attention Networks with molecular graph message-passing
2. DRFP features 2048-dim (we use 122 high-variance)
3. Mixture-aware continuous solvent representation
4. Transfer learning from larger chemical datasets

What this tells us:
- The problem IS solvable with much better accuracy
- But requires fundamentally different architecture (GNN)
- Simple MLP may have a ceiling around 0.01 CV / 0.09 LB


In [8]:
# Strategic options
print('=== STRATEGIC OPTIONS ===')
print('\nOption A: Continue with MLP improvements')
print('  - Try larger ensembles (10-20 models)')
print('  - Try per-target models')
print('  - Try different feature combinations')
print('  - Expected improvement: 5-10% (still far from target)')

print('\nOption B: Implement simplified GNN/attention')
print('  - Add attention layer to MLP')
print('  - Use graph-based features')
print('  - Risk: May not fit competition template')
print('  - Expected improvement: Unknown, high variance')

print('\nOption C: Focus on variance reduction')
print('  - Larger ensembles with same architecture')
print('  - Multiple random seeds, average predictions')
print('  - Snapshot ensembles')
print('  - Expected improvement: May reduce CV-LB gap')

print('\nOption D: Accept current best and optimize submission')
print('  - Our best LB (0.0972) is already competitive')
print('  - Focus on ensuring reproducibility')
print('  - Risk: May not beat target')

=== STRATEGIC OPTIONS ===

Option A: Continue with MLP improvements
  - Try larger ensembles (10-20 models)
  - Try per-target models
  - Try different feature combinations
  - Expected improvement: 5-10% (still far from target)

Option B: Implement simplified GNN/attention
  - Add attention layer to MLP
  - Use graph-based features
  - Risk: May not fit competition template
  - Expected improvement: Unknown, high variance

Option C: Focus on variance reduction
  - Larger ensembles with same architecture
  - Multiple random seeds, average predictions
  - Snapshot ensembles
  - Expected improvement: May reduce CV-LB gap

Option D: Accept current best and optimize submission
  - Our best LB (0.0972) is already competitive
  - Focus on ensuring reproducibility
  - Risk: May not beat target


In [9]:
# Recommendation
print('=== RECOMMENDATION ===')
print('\nGiven the evaluator feedback and experiment results:')
print('\n1. ABANDON deep/complex architectures - they hurt, not help')
print('\n2. RETURN to the best working approach (exp_003):')
print('   - Combined Spange + DRFP + Arrhenius features')
print('   - Simple MLP [256, 128, 64]')
print('   - 5 models bagged')
print('\n3. TRY variance reduction:')
print('   - Increase ensemble size to 10-15 models (same architecture)')
print('   - Use more epochs (300-400)')
print('   - This may reduce CV-LB gap')
print('\n4. TRY per-target models:')
print('   - Different models for SM vs Product 2 vs Product 3')
print('   - May capture different patterns')
print('\n5. CONSIDER alternative features:')
print('   - ACS PCA descriptors (mentioned in competition data)')
print('   - Weighted combination of Spange + ACS')
print('\n6. SUBMIT to validate:')
print('   - With 2 submissions remaining, use one to test variance reduction')
print('   - Save one for final best model')

=== RECOMMENDATION ===

Given the evaluator feedback and experiment results:

1. ABANDON deep/complex architectures - they hurt, not help

2. RETURN to the best working approach (exp_003):
   - Combined Spange + DRFP + Arrhenius features
   - Simple MLP [256, 128, 64]
   - 5 models bagged

3. TRY variance reduction:
   - Increase ensemble size to 10-15 models (same architecture)
   - Use more epochs (300-400)
   - This may reduce CV-LB gap

4. TRY per-target models:
   - Different models for SM vs Product 2 vs Product 3
   - May capture different patterns

5. CONSIDER alternative features:
   - ACS PCA descriptors (mentioned in competition data)
   - Weighted combination of Spange + ACS

6. SUBMIT to validate:
   - With 2 submissions remaining, use one to test variance reduction
   - Save one for final best model
