# Loop 21 Strategic Analysis

## Key Observations

### Submission History Analysis
| Exp | CV Score | LB Score | CV-LB Gap | Model Type |
|-----|----------|----------|-----------|------------|
| exp_004 | 0.0623 | 0.0956 | +53% | HGB+ETR Per-Target |
| exp_006 | 0.0688 | 0.0991 | +44% | HGB+ETR (regularized) |
| exp_011 | 0.0844 | ERROR | - | GroupKFold Ensemble |
| exp_016 | 0.0623 | 0.0956 | +53% | HGB+ETR Per-Target (replicate) |

### Critical Insights
1. **CV-LB gap is CONSISTENT at ~50%** - This is NOT model-specific, it's data-specific
2. **exp_011 (GroupKFold) FAILED** - Submission error, likely due to fold structure mismatch
3. **Best LB is 0.0956** - 5.5x away from target (0.01727)
4. **exp_021 (regularized) CV 0.0809** - WORSE than exp_004, don't submit

In [1]:
import pandas as pd
import numpy as np

# Analyze the gap
submissions = [
    {'exp': 'exp_004', 'cv': 0.0623, 'lb': 0.0956, 'gap': (0.0956-0.0623)/0.0623},
    {'exp': 'exp_006', 'cv': 0.0688, 'lb': 0.0991, 'gap': (0.0991-0.0688)/0.0688},
    {'exp': 'exp_016', 'cv': 0.0623, 'lb': 0.0956, 'gap': (0.0956-0.0623)/0.0623},
]

df = pd.DataFrame(submissions)
print("Submission Analysis:")
print(df)
print(f"\nAverage CV-LB gap: {df['gap'].mean()*100:.1f}%")
print(f"\nTo reach target LB 0.01727 with 50% gap, need CV: {0.01727/1.5:.4f}")
print(f"To reach target LB 0.01727 with 53% gap, need CV: {0.01727/1.53:.4f}")

Submission Analysis:
       exp      cv      lb       gap
0  exp_004  0.0623  0.0956  0.534510
1  exp_006  0.0688  0.0991  0.440407
2  exp_016  0.0623  0.0956  0.534510

Average CV-LB gap: 50.3%

To reach target LB 0.01727 with 50% gap, need CV: 0.0115
To reach target LB 0.01727 with 53% gap, need CV: 0.0113


## Top Kernel Analysis

### lishellliang/mixall Kernel (8 votes)
**Key Differences:**
1. Uses **GroupKFold (5 splits)** instead of LOO
2. MLP + XGBoost + RF + LightGBM ensemble
3. Optuna hyperparameter optimization
4. Spange descriptors only

**CRITICAL:** The kernel OVERWRITES the validation functions to use GroupKFold!
```python
def generate_leave_one_out_splits(X, Y):
    groups = X["SOLVENT NAME"]
    gkf = GroupKFold(n_splits=5)
    for train_idx, test_idx in gkf.split(X, Y, groups):
        yield ...
```

### sanidhyavijay24/arrhenius-kinetics-tta Kernel (38 votes, LB 0.09831)
**Key Features:**
1. Arrhenius kinetics features (1/T, ln(t), interaction)
2. TTA for mixed solvents (flip A/B)
3. 7-model bagging
4. HuberLoss instead of MSE
5. Uses LOO validation (standard)

**LB Score: 0.09831** - Similar to our best (0.0956)

## Why exp_011 (GroupKFold) Failed

The submission error occurred because:
1. GroupKFold produces 5 folds instead of 24 (single) / 13 (full)
2. The submission format expects specific fold counts
3. The evaluator couldn't match predictions to ground truth

**Solution:** We CANNOT use GroupKFold for submission - must use LOO

## Strategic Options

### Option A: Accept the 50% Gap and Optimize CV
- Target CV: 0.01727 / 1.5 = 0.0115
- Current best CV: 0.0623
- Gap to close: 0.0623 - 0.0115 = 0.0508 (82% improvement needed)
- **Verdict:** Extremely difficult with current approaches

### Option B: Reduce the CV-LB Gap
- If we can reduce gap to 30%, need CV: 0.01727 / 1.3 = 0.0133
- If we can reduce gap to 20%, need CV: 0.01727 / 1.2 = 0.0144
- **Verdict:** More feasible, but how?

### Option C: Fundamentally Different Approach
- Pre-trained molecular representations (ChemBERTa, MolBERT)
- Domain adaptation techniques
- Meta-learning for few-shot generalization
- **Verdict:** High risk, high reward

### Option D: Ensemble of Diverse Models
- Combine best models from different experiments
- Use uncertainty weighting
- **Verdict:** Moderate improvement expected

## Remaining Submissions: 2

### Submission Strategy
1. **DO NOT submit exp_021** - CV 0.0809 is 30% worse than best
2. **Only submit if:**
   - CV < 0.055 (significant improvement)
   - OR fundamentally different approach with similar CV
3. **Save last submission for best candidate**

### What to Try Next
1. **Pre-trained molecular embeddings** - Use ChemBERTa or similar
2. **Solvent similarity features** - Weight predictions by similarity to training solvents
3. **Uncertainty-aware predictions** - Use GP or ensemble variance
4. **Domain adaptation** - Techniques for OOD generalization