# Loop 48 Analysis: Critical Strategic Assessment

## Key Discovery: The mixall Kernel Overwrites Validation Functions!

The evaluator correctly identified that the 'mixall' kernel uses **GroupKFold (5 splits)** instead of **Leave-One-Out (24 folds)**. This is done by OVERWRITING the utility functions:

```python
# From mixall kernel:
def generate_leave_one_out_splits(X, Y):
    groups = X["SOLVENT NAME"]
    n_splits = min(5, len(groups.unique()))
    gkf = GroupKFold(n_splits=n_splits)
    ...
```

## Critical Question: Is This Allowed?

The competition rules state:
> "the submission must have the same last three cells as in the notebook template, with the only allowed change being the line where the model is defined"

BUT the mixall kernel overwrites the utility functions BEFORE the last three cells. This is a gray area - they're not changing the last three cells, but they're changing the behavior of functions called by those cells.

## Why This Matters

If the LB evaluation uses the ORIGINAL `generate_leave_one_out_splits` (24 folds), then:
- Our CV (24 folds) matches LB evaluation
- The mixall kernel's CV (5 folds) does NOT match LB evaluation
- Their good LB score comes from their MODEL, not their CV scheme

If the LB evaluation uses GroupKFold (5 folds), then:
- Our CV (24 folds) does NOT match LB evaluation
- The mixall kernel's CV (5 folds) matches LB evaluation
- We should switch to GroupKFold for better CV-LB correlation

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

# Submission history
submissions = [
    {'exp': 'exp_000', 'cv': 0.0111, 'lb': 0.0982},
    {'exp': 'exp_001', 'cv': 0.0123, 'lb': 0.1065},
    {'exp': 'exp_003', 'cv': 0.0105, 'lb': 0.0972},
    {'exp': 'exp_005', 'cv': 0.0104, 'lb': 0.0969},
    {'exp': 'exp_006', 'cv': 0.0097, 'lb': 0.0946},
    {'exp': 'exp_007', 'cv': 0.0093, 'lb': 0.0932},
    {'exp': 'exp_009', 'cv': 0.0092, 'lb': 0.0936},
    {'exp': 'exp_012', 'cv': 0.0090, 'lb': 0.0913},
    {'exp': 'exp_024', 'cv': 0.0087, 'lb': 0.0893},
    {'exp': 'exp_026', 'cv': 0.0085, 'lb': 0.0887},
    {'exp': 'exp_030', 'cv': 0.0083, 'lb': 0.0877},
    {'exp': 'exp_035', 'cv': 0.0098, 'lb': 0.0970},
]

df = pd.DataFrame(submissions)
print('Submission history:')
print(df)
print()

# Linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(df['cv'], df['lb'])
print(f'CV-LB Relationship: LB = {slope:.2f} * CV + {intercept:.4f}')
print(f'R² = {r_value**2:.4f}')
print(f'Intercept = {intercept:.4f}')
print(f'Target = 0.0347')
print()
print(f'CRITICAL: Intercept ({intercept:.4f}) > Target (0.0347)')

In [None]:
# What would it take to reach the target?
target = 0.0347
best_lb = 0.0877
best_cv = 0.0083

print('=== PATH TO TARGET ===')
print()
print(f'Current best LB: {best_lb:.4f}')
print(f'Target: {target:.4f}')
print(f'Gap: {(best_lb - target) / target * 100:.1f}%')
print()
print('Option 1: Improve CV (current approach)')
required_cv = (target - intercept) / slope
print(f'  Required CV: {required_cv:.6f}')
if required_cv < 0:
    print('  IMPOSSIBLE: Required CV is negative!')
print()
print('Option 2: Change the CV-LB relationship')
print('  Need to reduce the intercept from 0.0525 to < 0.0347')
print('  OR change the slope to make CV improvements more impactful')
print()
print('Option 3: Find a fundamentally different approach')
print('  The GNN benchmark achieved MSE 0.0039')
print('  This is 22x better than our best LB')
print('  There IS a path to much better performance')

## Key Insight: The Validation Scheme Hypothesis

The evaluator has been recommending testing GroupKFold for several loops. Let me analyze why this could help:

### Leave-One-Out (24 folds) vs GroupKFold (5 folds)

**Leave-One-Out (24 folds):**
- Each fold holds out ONE solvent (all samples for that solvent)
- 23 solvents in training, 1 in test
- Very pessimistic: model must predict completely unseen solvent
- High variance: some solvents are easy, some are hard

**GroupKFold (5 folds):**
- Each fold holds out ~5 solvents (all samples for those solvents)
- ~19 solvents in training, ~5 in test
- Less pessimistic: model sees more diversity in test set
- Lower variance: averaging over multiple solvents per fold

### Why GroupKFold Might Help

1. **Better CV-LB correlation**: If LB uses a similar scheme, our CV would better predict LB
2. **Different model selection**: Models that perform best under GroupKFold may be different
3. **More stable CV**: Averaging over 5 solvents per fold reduces variance

### BUT: The Competition Rules

The rules say we can only change the model definition line. We CANNOT overwrite the utility functions like mixall does. So even if GroupKFold is better, we can't use it in our submission.

**HOWEVER:** We CAN use GroupKFold for LOCAL model selection, then submit the best model using the original Leave-One-Out scheme.

In [None]:
# Analyze the experiment history
print('=== EXPERIMENT HISTORY ANALYSIS ===')
print()
print('Last 17 experiments (exp_030 to exp_047) have ALL been worse than exp_030:')
print()

experiments = [
    ('exp_030', 0.008298, 'GP + MLP + LGBM ensemble (BEST)'),
    ('exp_031', 0.008504, 'Higher GP weight'),
    ('exp_032', 0.039713, 'Pure GP'),
    ('exp_033', 0.010286, 'Ridge regression'),
    ('exp_034', 0.010286, 'Kernel Ridge'),
    ('exp_035', 0.009800, 'Lower GP weight'),
    ('exp_036', 0.008680, 'No GP'),
    ('exp_037', 0.026541, 'Similarity weighting'),
    ('exp_038', 0.009948, 'Minimal features'),
    ('exp_039', 0.080438, 'Learned embeddings'),
    ('exp_040', 0.068767, 'GNN (AttentiveFP)'),
    ('exp_041', 0.010288, 'ChemBERTa'),
    ('exp_042', 0.010008, 'Calibration'),
    ('exp_043', 0.008700, 'Non-linear mixture'),
    ('exp_044', 0.008700, 'Hybrid model'),
    ('exp_045', 0.008840, 'Mean reversion'),
    ('exp_046', 0.008600, 'Adaptive weighting'),
    ('exp_047', 0.009393, 'Diverse ensemble'),
]

for exp, cv, desc in experiments:
    diff = (cv - 0.008298) / 0.008298 * 100
    status = '✓ BEST' if exp == 'exp_030' else f'{diff:+.1f}%'
    print(f'{exp}: CV={cv:.6f} ({status}) - {desc}')

print()
print('CONCLUSION: We are in a LOCAL OPTIMUM.')
print('Incremental changes are not working.')

In [None]:
# What approaches haven't been tried?
print('=== UNEXPLORED APPROACHES ===')
print()
print('1. PROPER GNN IMPLEMENTATION')
print('   - exp_040 was a quick test on single fold')
print('   - Need proper hyperparameter tuning')
print('   - Need full CV evaluation')
print('   - The GNN benchmark achieved MSE 0.0039 - 22x better!')
print()
print('2. TRANSFER LEARNING / PRE-TRAINING')
print('   - Pre-train on mixture data, fine-tune on single solvents')
print('   - Use auxiliary tasks to improve representations')
print('   - Competition rules allow different hyperparameters for different tasks')
print()
print('3. SOLVENT CLUSTERING + SPECIALIZED MODELS')
print('   - Cluster solvents by Spange descriptors')
print('   - Train specialized models for each cluster')
print('   - For outliers (Cyclohexane, HFIP), use nearest neighbor approach')
print()
print('4. ENSEMBLE OF DIVERSE MODELS WITH DIFFERENT FEATURES')
print('   - exp_047 used simpler features (18 vs 2066)')
print('   - It reduced error on Cyclohexane from 0.198 to 0.014!')
print('   - But overall CV was worse')
print('   - What if we ensemble models with different feature sets?')

## Strategic Recommendation

### The Core Problem

The CV-LB relationship has an intercept (0.0525) that is HIGHER than the target (0.0347). This means:
- Even with CV = 0, we'd get LB = 0.0525
- The target is mathematically UNREACHABLE by improving CV alone
- We need to CHANGE the relationship, not just improve CV

### What Could Change the Relationship?

1. **Different model architecture** - GNN could have a different CV-LB relationship
2. **Different features** - Simpler features might generalize better to LB
3. **Ensemble of diverse models** - Combining models with different strengths

### The Diverse Ensemble Insight

Experiment 047 showed that:
- Simpler features (18 vs 2066) reduced Cyclohexane error from 0.198 to 0.014
- But overall CV was 13.2% worse
- This suggests DRFP features help most solvents but hurt outliers

**What if we ensemble:**
1. Model A: Full features (Spange + DRFP) - good for most solvents
2. Model B: Simple features (Spange only) - good for outliers
3. Weighted by solvent similarity to training set?

### Recommended Next Experiment

**Hybrid Feature Ensemble:**
- Train two models: one with full features, one with simple features
- For each test solvent, compute similarity to training solvents
- If similar (in-distribution): use full-feature model
- If dissimilar (out-of-distribution): use simple-feature model
- This could reduce error on outliers without hurting in-distribution solvents

In [None]:
# Summary of key findings
print('=== SUMMARY OF KEY FINDINGS ===')
print()
print('1. CV-LB RELATIONSHIP:')
print('   - LB = 4.31 * CV + 0.0525 (R² = 0.95)')
print('   - Intercept (0.0525) > Target (0.0347)')
print('   - Current approach CANNOT reach target')
print()
print('2. BEST MODEL:')
print('   - exp_030: GP 0.15 + MLP 0.55 + LGBM 0.3')
print('   - CV = 0.008298, LB = 0.0877')
print('   - 17 subsequent experiments all worse')
print()
print('3. KEY INSIGHTS:')
print('   - Simpler features reduce error on outliers (Cyclohexane: 0.198 -> 0.014)')
print('   - But simpler features hurt overall CV')
print('   - Need to combine strengths of both approaches')
print()
print('4. REMAINING SUBMISSIONS: 5')
print('   - Need to be strategic')
print('   - Should test fundamentally different approaches')
print()
print('5. RECOMMENDED APPROACH:')
print('   - Hybrid Feature Ensemble: full features for in-distribution, simple for OOD')
print('   - OR: Proper GNN implementation (benchmark achieved 0.0039)')
print('   - OR: Solvent clustering + specialized models')