# Loop 10 Analysis: GroupKFold vs Leave-One-Out

The evaluator identified a CRITICAL insight: the top kernel uses GroupKFold (5-fold) instead of Leave-One-Out.

**Key Questions:**
1. How does GroupKFold change CV estimates?
2. What's the CV-LB correlation with different validation strategies?
3. What's the optimal path forward?

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import GroupKFold
import warnings
warnings.filterwarnings('ignore')

DATA_PATH = '/home/data'
TARGET_LABELS = ["Product 2", "Product 3", "SM"]

# Load data
def load_data(name="full"):
    if name == "full":
        df = pd.read_csv(f'{DATA_PATH}/catechol_full_data_yields.csv')
        X = df[["Residence Time", "Temperature", "SOLVENT A NAME", "SOLVENT B NAME", "SolventB%"]]
    else:
        df = pd.read_csv(f'{DATA_PATH}/catechol_single_solvent_yields.csv')
        X = df[["Residence Time", "Temperature", "SOLVENT NAME"]]
    Y = df[TARGET_LABELS]
    return X, Y

X_single, Y_single = load_data("single_solvent")
X_full, Y_full = load_data("full")

print(f"Single solvent: {len(X_single)} samples, {X_single['SOLVENT NAME'].nunique()} solvents")
print(f"Full data: {len(X_full)} samples, {len(X_full.groupby(['SOLVENT A NAME', 'SOLVENT B NAME']))} ramps")

Single solvent: 656 samples, 24 solvents
Full data: 1227 samples, 13 ramps


In [2]:
# Compare Leave-One-Out vs GroupKFold (5-fold)

# Leave-One-Out: 24 folds for single, 13 folds for full
print("=== Leave-One-Out ===")
print(f"Single solvent: {X_single['SOLVENT NAME'].nunique()} folds")
print(f"Full data: {len(X_full.groupby(['SOLVENT A NAME', 'SOLVENT B NAME']))} folds")

# GroupKFold (5-fold)
print("\n=== GroupKFold (5-fold) ===")
groups_single = X_single["SOLVENT NAME"]
gkf_single = GroupKFold(n_splits=5)
for fold_idx, (train_idx, test_idx) in enumerate(gkf_single.split(X_single, Y_single, groups_single)):
    train_solvents = groups_single.iloc[train_idx].unique()
    test_solvents = groups_single.iloc[test_idx].unique()
    print(f"Fold {fold_idx}: Train {len(train_solvents)} solvents, Test {len(test_solvents)} solvents")
    print(f"  Test solvents: {list(test_solvents)}")

print("\n=== Key Insight ===")
print("LOO: Each fold has 1 solvent in test (4.2% of data)")
print("GroupKFold: Each fold has ~5 solvents in test (20% of data)")
print("GroupKFold gives more realistic CV estimates because test set has multiple unseen solvents")

=== Leave-One-Out ===
Single solvent: 24 folds
Full data: 13 folds

=== GroupKFold (5-fold) ===
Fold 0: Train 19 solvents, Test 5 solvents
  Test solvents: ['IPA [Propan-2-ol]', 'Acetonitrile', 'Diethyl Ether [Ether]', 'THF [Tetrahydrofuran]', 'Methyl Propionate']
Fold 1: Train 20 solvents, Test 4 solvents
  Test solvents: ['2-Methyltetrahydrofuran [2-MeTHF]', 'Cyclohexane', 'Decanol', 'Dihydrolevoglucosenone (Cyrene)']
Fold 2: Train 19 solvents, Test 5 solvents
  Test solvents: ['Methanol', 'Ethylene Glycol [1,2-Ethanediol]', 'Ethanol', 'Ethyl Acetate', 'Ethyl Lactate']
Fold 3: Train 19 solvents, Test 5 solvents
  Test solvents: ['1,1,1,3,3,3-Hexafluoropropan-2-ol', 'Water.2,2,2-Trifluoroethanol', 'DMA [N,N-Dimethylacetamide]', 'MTBE [tert-Butylmethylether]', 'Dimethyl Carbonate']
Fold 4: Train 19 solvents, Test 5 solvents
  Test solvents: ['Water.Acetonitrile', 'Acetonitrile.Acetic Acid', '2,2,2-Trifluoroethanol', 'Butanone [MEK]', 'tert-Butanol [2-Methylpropan-2-ol]']

=== Key Insig

In [3]:
# Analyze CV-LB gap with our submissions
print("=== CV-LB Gap Analysis ===")
print("\nSubmission 1 (exp_004): Per-target HGB+ETR NO TTA")
print(f"  CV (LOO): 0.0623")
print(f"  LB: 0.0956")
print(f"  Gap: +53% (0.0333 absolute)")

print("\nSubmission 2 (exp_006): Intermediate regularization")
print(f"  CV (LOO): 0.0688")
print(f"  LB: 0.0991")
print(f"  Gap: +44% (0.0303 absolute)")

print("\n=== Interpretation ===")
print("Both submissions have ~50% CV-LB gap")
print("This suggests LOO CV is OVERLY OPTIMISTIC")
print("GroupKFold may give more realistic (higher) CV but better CV-LB correlation")
print("\nTarget: 0.01727")
print("Best LB: 0.0956")
print("Gap to target: 5.5x")

=== CV-LB Gap Analysis ===

Submission 1 (exp_004): Per-target HGB+ETR NO TTA
  CV (LOO): 0.0623
  LB: 0.0956
  Gap: +53% (0.0333 absolute)

Submission 2 (exp_006): Intermediate regularization
  CV (LOO): 0.0688
  LB: 0.0991
  Gap: +44% (0.0303 absolute)

=== Interpretation ===
Both submissions have ~50% CV-LB gap
This suggests LOO CV is OVERLY OPTIMISTIC
GroupKFold may give more realistic (higher) CV but better CV-LB correlation

Target: 0.01727
Best LB: 0.0956
Gap to target: 5.5x


In [4]:
# What would GroupKFold CV look like?
# Let's estimate based on the pattern

print("=== Hypothesis: GroupKFold CV Estimates ===")
print("\nIf LOO CV is 0.0623 and LB is 0.0956:")
print("  LOO underestimates by ~50%")
print("  GroupKFold (5-fold) should give CV closer to LB")
print("  Expected GroupKFold CV: ~0.08-0.10")

print("\n=== Why GroupKFold Matters ===")
print("1. More realistic CV estimates")
print("2. Better CV-LB correlation")
print("3. Faster training (5 folds vs 24/13 folds)")
print("4. More diverse validation sets (20% vs 4%)")

print("\n=== Strategic Implication ===")
print("If we adopt GroupKFold:")
print("  - CV will be HIGHER (worse) but more realistic")
print("  - We can trust CV improvements to translate to LB")
print("  - We can iterate faster (5 folds vs 24/13)")
print("  - We can use CV to guide model selection")

=== Hypothesis: GroupKFold CV Estimates ===

If LOO CV is 0.0623 and LB is 0.0956:
  LOO underestimates by ~50%
  GroupKFold (5-fold) should give CV closer to LB
  Expected GroupKFold CV: ~0.08-0.10

=== Why GroupKFold Matters ===
1. More realistic CV estimates
2. Better CV-LB correlation
3. Faster training (5 folds vs 24/13 folds)
4. More diverse validation sets (20% vs 4%)

=== Strategic Implication ===
If we adopt GroupKFold:
  - CV will be HIGHER (worse) but more realistic
  - We can trust CV improvements to translate to LB
  - We can iterate faster (5 folds vs 24/13)
  - We can use CV to guide model selection


In [5]:
# Analyze what the top kernel does differently
print("=== Top Kernel (lishellliang) Analysis ===")
print("\n1. Validation: GroupKFold (5-fold) instead of LOO")
print("2. Model: MLP + XGBoost + RF + LightGBM ensemble")
print("3. Weights: [0.4, 0.2, 0.2, 0.2] for MLP, XGB, RF, LGBM")
print("4. Features: Spange descriptors only")
print("5. MLP: [128, 64, 32] hidden dims, BatchNorm+ReLU+Dropout(0.1)")
print("6. GBDT: n_estimators=300, max_depth=6/15")
print("7. MLP training: 100 epochs, lr=1e-3, MSELoss (no Sigmoid)")

print("\n=== Our exp_010 vs Top Kernel ===")
print("Same: MLP + XGBoost + RF + LightGBM ensemble")
print("Same: Spange descriptors only")
print("Different: We use LOO, they use GroupKFold")
print("Different: We use Sigmoid output, they don't")
print("Different: We use 200 epochs, they use 100")
print("Different: Our weights [0.35, 0.25, 0.25, 0.15] vs their [0.4, 0.2, 0.2, 0.2]")

print("\n=== Key Insight ===")
print("The MOST IMPORTANT difference is GroupKFold vs LOO")
print("This affects how we evaluate models and select hyperparameters")

=== Top Kernel (lishellliang) Analysis ===

1. Validation: GroupKFold (5-fold) instead of LOO
2. Model: MLP + XGBoost + RF + LightGBM ensemble
3. Weights: [0.4, 0.2, 0.2, 0.2] for MLP, XGB, RF, LGBM
4. Features: Spange descriptors only
5. MLP: [128, 64, 32] hidden dims, BatchNorm+ReLU+Dropout(0.1)
6. GBDT: n_estimators=300, max_depth=6/15
7. MLP training: 100 epochs, lr=1e-3, MSELoss (no Sigmoid)

=== Our exp_010 vs Top Kernel ===
Same: MLP + XGBoost + RF + LightGBM ensemble
Same: Spange descriptors only
Different: We use LOO, they use GroupKFold
Different: We use Sigmoid output, they don't
Different: We use 200 epochs, they use 100
Different: Our weights [0.35, 0.25, 0.25, 0.15] vs their [0.4, 0.2, 0.2, 0.2]

=== Key Insight ===
The MOST IMPORTANT difference is GroupKFold vs LOO
This affects how we evaluate models and select hyperparameters


In [6]:
# What should we try next?
print("=== Next Steps ===")
print("\nPriority 1: Adopt GroupKFold (5-fold) validation")
print("  - This is the SINGLE MOST IMPORTANT change")
print("  - Will give more realistic CV estimates")
print("  - Will allow us to trust CV improvements")

print("\nPriority 2: Match top kernel architecture exactly")
print("  - MLP without Sigmoid output")
print("  - 100 epochs, lr=1e-3")
print("  - Weights [0.4, 0.2, 0.2, 0.2]")
print("  - n_estimators=300")

print("\nPriority 3: Submit to verify CV-LB correlation")
print("  - We have 3 submissions remaining")
print("  - Submit GroupKFold model to verify if CV-LB gap is smaller")

print("\n=== CRITICAL: Template Compliance ===")
print("The template uses LOO validation in the last 3 cells")
print("We CANNOT change the validation in the last 3 cells")
print("BUT we can use GroupKFold for model selection/hyperparameter tuning")
print("Then use the best model with LOO for final submission")

print("\n=== Alternative Strategy ===")
print("Since we can't change validation in submission:")
print("1. Use GroupKFold internally for model selection")
print("2. Use LOO for final submission (as required by template)")
print("3. Focus on models that generalize well to unseen solvents")

=== Next Steps ===

Priority 1: Adopt GroupKFold (5-fold) validation
  - This is the SINGLE MOST IMPORTANT change
  - Will give more realistic CV estimates
  - Will allow us to trust CV improvements

Priority 2: Match top kernel architecture exactly
  - MLP without Sigmoid output
  - 100 epochs, lr=1e-3
  - Weights [0.4, 0.2, 0.2, 0.2]
  - n_estimators=300

Priority 3: Submit to verify CV-LB correlation
  - We have 3 submissions remaining
  - Submit GroupKFold model to verify if CV-LB gap is smaller

=== CRITICAL: Template Compliance ===
The template uses LOO validation in the last 3 cells
We CANNOT change the validation in the last 3 cells
BUT we can use GroupKFold for model selection/hyperparameter tuning
Then use the best model with LOO for final submission

=== Alternative Strategy ===
Since we can't change validation in submission:
1. Use GroupKFold internally for model selection
2. Use LOO for final submission (as required by template)
3. Focus on models that generalize well to u