# Loop 25 Analysis: Critical Strategic Assessment

**Situation:**
- 25 experiments completed
- Best LB: 0.0956 (exp_004, exp_016)
- Target: 0.01727 (5.5x gap)
- Only 1 submission remaining
- CV-LB gap: ~53% consistently

**Key Questions:**
1. Why did GroupKFold (exp_011) fail?
2. What approaches haven't been tried?
3. What's the path to 0.01727?

In [1]:
import pandas as pd
import numpy as np
import json

# Load session state
with open('/home/code/session_state.json', 'r') as f:
    state = json.load(f)

# Analyze experiments
print('=== EXPERIMENT ANALYSIS ===')
for exp in state['experiments']:
    print(f"{exp['id']}: {exp['name'][:50]} | CV: {exp['score']:.4f}")

print('\n=== SUBMISSIONS ===')
for sub in state['submissions']:
    print(f"{sub['candidate_id']}: CV {sub['cv_score']:.4f} -> LB {sub['lb_score']}")
    if sub['lb_score'] and sub['lb_score'] != '':
        try:
            gap = (float(sub['lb_score']) - sub['cv_score']) / sub['cv_score'] * 100
            print(f"  Gap: {gap:.1f}%")
        except:
            pass

=== EXPERIMENT ANALYSIS ===
exp_000: 001_baseline_ensemble_arrhenius_tta | CV: 0.0814
exp_001: 002_template_compliant_ensemble | CV: 0.0810
exp_002: 003_simple_rf_regularized | CV: 0.0805
exp_003: 004_per_target_heterogeneous | CV: 0.0813
exp_004: 005_no_tta_per_target | CV: 0.0623
exp_005: 006_regularized_ridge_baseline | CV: 0.0896
exp_006: 007_intermediate_regularization | CV: 0.0688
exp_007: 008_gaussian_process | CV: 0.0721
exp_008: 009_diverse_ensemble | CV: 0.0673
exp_009: exp_010: MLP + GBDT Ensemble (Like Top Kernel) | CV: 0.0669
exp_010: exp_011: GroupKFold + Top Kernel Architecture | CV: 0.0841
exp_011: exp_012: Template-Compliant GroupKFold Ensemble | CV: 0.0844
exp_012: exp_013: LOO Ensemble (Fixed Fold Structure) | CV: 0.0827
exp_013: exp_014: Per-Target + Optuna Optimization | CV: 0.0834
exp_014: exp_015: Per-Target + MLP Hybrid with COMBINED Fea | CV: 0.0891
exp_015: exp_016: Hybrid Model with Task-Specific Configura | CV: 0.0830
exp_016: exp_017: Replicate exp_004's EX

In [2]:
# Analyze why GroupKFold failed
print('=== GROUPKFOLD FAILURE ANALYSIS ===')
print()
print('exp_011 submission format:')
df_failed = pd.read_csv('/home/code/experiments/012_template_compliant/submission.csv')
print(f'Task 0 folds: {sorted(df_failed[df_failed.task==0].fold.unique())}')
print(f'Task 1 folds: {sorted(df_failed[df_failed.task==1].fold.unique())}')
print()
print('exp_004 submission format (working):')
df_working = pd.read_csv('/home/code/experiments/005_no_tta_per_target/submission.csv')
print(f'Task 0 folds: {sorted(df_working[df_working.task==0].fold.unique())}')
print(f'Task 1 folds: {sorted(df_working[df_working.task==1].fold.unique())}')
print()
print('CONCLUSION: GroupKFold changes fold structure from 24/13 to 5/5')
print('This causes submission validation to fail!')
print('We CANNOT use GroupKFold for submission - must use LOO.')

=== GROUPKFOLD FAILURE ANALYSIS ===

exp_011 submission format:
Task 0 folds: [0, 1, 2, 3, 4]
Task 1 folds: [0, 1, 2, 3, 4]

exp_004 submission format (working):
Task 0 folds: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
Task 1 folds: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

CONCLUSION: GroupKFold changes fold structure from 24/13 to 5/5
This causes submission validation to fail!
We CANNOT use GroupKFold for submission - must use LOO.


In [3]:
# Analyze what approaches have been tried
print('=== APPROACHES TRIED ===')
approaches = {
    'Tree-based (HGB, ETR, RF, XGB, LGB)': ['exp_003', 'exp_004', 'exp_005', 'exp_006', 'exp_008'],
    'Neural Networks (MLP)': ['exp_000', 'exp_001', 'exp_009', 'exp_010', 'exp_012'],
    'Gaussian Process': ['exp_007'],
    'Ensembles': ['exp_000', 'exp_001', 'exp_008', 'exp_009', 'exp_010'],
    'Per-target models': ['exp_003', 'exp_004', 'exp_005', 'exp_013'],
    'Feature engineering (Arrhenius)': ['exp_000', 'exp_001', 'exp_003', 'exp_004'],
    'TTA for mixed solvents': ['exp_000', 'exp_001', 'exp_003'],
    'No TTA': ['exp_004', 'exp_005'],
    'Regularization': ['exp_005', 'exp_006', 'exp_021'],
    'Morgan fingerprints': ['exp_024'],
    'DRFP features': ['exp_018'],
    'Stacking': ['exp_023'],
    'Similarity weighting': ['exp_022'],
}

for approach, exps in approaches.items():
    print(f'{approach}: {len(exps)} experiments')

=== APPROACHES TRIED ===
Tree-based (HGB, ETR, RF, XGB, LGB): 5 experiments
Neural Networks (MLP): 5 experiments
Gaussian Process: 1 experiments
Ensembles: 5 experiments
Per-target models: 4 experiments
Feature engineering (Arrhenius): 4 experiments
TTA for mixed solvents: 3 experiments
No TTA: 2 experiments
Regularization: 3 experiments
Morgan fingerprints: 1 experiments
DRFP features: 1 experiments
Stacking: 1 experiments
Similarity weighting: 1 experiments


In [4]:
# What HASN'T been tried?
print('=== APPROACHES NOT YET TRIED ===')
print()
print('1. Graph Neural Networks (GNNs)')
print('   - exp_020 attempted but may have had issues')
print('   - Literature shows GNNs achieve MSE 0.0039 on this benchmark')
print()
print('2. Pre-trained molecular embeddings')
print('   - ChemBERTa, MolBERT, etc.')
print('   - Could capture chemical knowledge that generalizes')
print()
print('3. Transfer learning from larger datasets')
print('   - Pre-train on USPTO or similar, fine-tune on Catechol')
print()
print('4. Physics-informed neural networks')
print('   - Encode Arrhenius equation directly in architecture')
print()
print('5. Meta-learning / few-shot approaches')
print('   - Learn to adapt to new solvents quickly')

=== APPROACHES NOT YET TRIED ===

1. Graph Neural Networks (GNNs)
   - exp_020 attempted but may have had issues
   - Literature shows GNNs achieve MSE 0.0039 on this benchmark

2. Pre-trained molecular embeddings
   - ChemBERTa, MolBERT, etc.
   - Could capture chemical knowledge that generalizes

3. Transfer learning from larger datasets
   - Pre-train on USPTO or similar, fine-tune on Catechol

4. Physics-informed neural networks
   - Encode Arrhenius equation directly in architecture

5. Meta-learning / few-shot approaches
   - Learn to adapt to new solvents quickly


In [5]:
# Analyze the CV-LB gap
print('=== CV-LB GAP ANALYSIS ===')
print()
submissions = [
    ('exp_004', 0.0623, 0.0956),
    ('exp_006', 0.0688, 0.0991),
    ('exp_016', 0.0623, 0.0956),
    ('exp_021', 0.0901, 0.1231),
]

for name, cv, lb in submissions:
    gap = (lb - cv) / cv * 100
    print(f'{name}: CV {cv:.4f} -> LB {lb:.4f} (gap: {gap:.1f}%)')

print()
print('OBSERVATION: CV-LB gap is ~50% consistently')
print('This suggests test set has fundamentally different solvents')
print('Our models are overfitting to training solvents')
print()
print('To reach target LB 0.01727:')
print(f'  If gap is 50%, need CV ~0.0115')
print(f'  Current best CV: 0.0623')
print(f'  Need {(0.0623 - 0.0115) / 0.0623 * 100:.1f}% improvement in CV')

=== CV-LB GAP ANALYSIS ===

exp_004: CV 0.0623 -> LB 0.0956 (gap: 53.5%)
exp_006: CV 0.0688 -> LB 0.0991 (gap: 44.0%)
exp_016: CV 0.0623 -> LB 0.0956 (gap: 53.5%)
exp_021: CV 0.0901 -> LB 0.1231 (gap: 36.6%)

OBSERVATION: CV-LB gap is ~50% consistently
This suggests test set has fundamentally different solvents
Our models are overfitting to training solvents

To reach target LB 0.01727:
  If gap is 50%, need CV ~0.0115
  Current best CV: 0.0623
  Need 81.5% improvement in CV


In [6]:
# Strategic assessment
print('=== STRATEGIC ASSESSMENT ===')
print()
print('CRITICAL CONSTRAINTS:')
print('1. Only 1 submission remaining')
print('2. Must use LOO validation (24/13 folds) for submission')
print('3. CV-LB gap is ~50% - cannot trust CV alone')
print()
print('BEST PATH FORWARD:')
print('1. Focus on approaches that generalize to unseen solvents')
print('2. Use physics-based features that transfer')
print('3. Consider simpler models with less overfitting')
print('4. Ensemble diverse models for robustness')
print()
print('WHAT NOT TO DO:')
print('1. More complex models (will overfit more)')
print('2. More features (Morgan fingerprints hurt)')
print('3. GroupKFold (breaks submission format)')
print('4. TTA for mixed solvents (hurts performance)')

=== STRATEGIC ASSESSMENT ===

CRITICAL CONSTRAINTS:
1. Only 1 submission remaining
2. Must use LOO validation (24/13 folds) for submission
3. CV-LB gap is ~50% - cannot trust CV alone

BEST PATH FORWARD:
1. Focus on approaches that generalize to unseen solvents
2. Use physics-based features that transfer
3. Consider simpler models with less overfitting
4. Ensemble diverse models for robustness

WHAT NOT TO DO:
1. More complex models (will overfit more)
2. More features (Morgan fingerprints hurt)
3. GroupKFold (breaks submission format)
4. TTA for mixed solvents (hurts performance)


In [7]:
# Check what the best experiments have in common
print('=== BEST EXPERIMENTS ANALYSIS ===')
print()
best_exps = [
    ('exp_004', 0.0623, 0.0956, 'PerTarget HGB+ETR NO TTA'),
    ('exp_016', 0.0623, 0.0956, 'Replicate exp_004'),
    ('exp_006', 0.0688, 0.0991, 'Intermediate regularization'),
]

print('Common elements in best experiments:')
print('1. Per-target architecture (HGB for SM, ETR for Products)')
print('2. Feature weighting: 0.8 ACS_PCA + 0.2 Spange')
print('3. NO TTA for mixed solvents')
print('4. Arrhenius kinetics features')
print()
print('Key insight: exp_004 and exp_016 are IDENTICAL')
print('Both achieve CV 0.0623, LB 0.0956')
print('This is our best approach so far')

=== BEST EXPERIMENTS ANALYSIS ===

Common elements in best experiments:
1. Per-target architecture (HGB for SM, ETR for Products)
2. Feature weighting: 0.8 ACS_PCA + 0.2 Spange
3. NO TTA for mixed solvents
4. Arrhenius kinetics features

Key insight: exp_004 and exp_016 are IDENTICAL
Both achieve CV 0.0623, LB 0.0956
This is our best approach so far
