# Loop 41 Analysis: GroupKFold(5) Hypothesis Disproven

**Key Result**: GroupKFold(5) CV (0.009237) is only 1.13x higher than Leave-One-Out CV (0.008199).

This is NOT the dramatic increase expected (3-5x). The CV-LB gap is NOT due to the CV procedure - it's STRUCTURAL.

**Implications**:
1. The "mixall" kernel's claim of "good CV-LB" may be misleading
2. The CV-LB gap is due to something else (model variance, hidden test data, etc.)
3. We need a fundamentally different approach

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Submission history
submissions = [
    {'exp': 'exp_000', 'cv': 0.0111, 'lb': 0.0982},
    {'exp': 'exp_001', 'cv': 0.0123, 'lb': 0.1065},
    {'exp': 'exp_003', 'cv': 0.0105, 'lb': 0.0972},
    {'exp': 'exp_005', 'cv': 0.0104, 'lb': 0.0969},
    {'exp': 'exp_006', 'cv': 0.0097, 'lb': 0.0946},
    {'exp': 'exp_007', 'cv': 0.0093, 'lb': 0.0932},
    {'exp': 'exp_009', 'cv': 0.0092, 'lb': 0.0936},
    {'exp': 'exp_012', 'cv': 0.0090, 'lb': 0.0913},
    {'exp': 'exp_024', 'cv': 0.0087, 'lb': 0.0893},
    {'exp': 'exp_026', 'cv': 0.0085, 'lb': 0.0887},
    {'exp': 'exp_030', 'cv': 0.0083, 'lb': 0.0877},
]

df = pd.DataFrame(submissions)
print("Submission History:")
print(df.to_string(index=False))

In [None]:
# Analyze CV-LB relationship
from scipy import stats

cv_values = df['cv'].values
lb_values = df['lb'].values

slope, intercept, r_value, p_value, std_err = stats.linregress(cv_values, lb_values)

print(f"\nCV-LB Relationship:")
print(f"  LB = {slope:.2f} × CV + {intercept:.4f}")
print(f"  R² = {r_value**2:.4f}")
print(f"  Intercept = {intercept:.4f}")
print(f"  Target = 0.0347")
print(f"  Gap: Intercept is {intercept/0.0347:.2f}x higher than target")

# What CV would be needed to reach target?
required_cv = (0.0347 - intercept) / slope
print(f"\nTo reach target with current relationship:")
print(f"  Required CV = {required_cv:.6f}")
if required_cv < 0:
    print(f"  IMPOSSIBLE - would require negative CV!")

In [None]:
# GroupKFold(5) experiment results
print("\n=== GroupKFold(5) Experiment Results ===")
print(f"Leave-One-Out CV: 0.008199")
print(f"GroupKFold(5) CV: 0.009237")
print(f"Ratio: {0.009237/0.008199:.2f}x")
print(f"\nExpected ratio if CV-LB gap was due to CV procedure: 3-5x")
print(f"Actual ratio: 1.13x")
print(f"\nCONCLUSION: The CV-LB gap is NOT due to the CV procedure!")

In [None]:
# What does this mean for our strategy?
print("\n=== Strategic Implications ===")
print("\n1. The CV-LB gap is STRUCTURAL, not procedural")
print("   - Changing CV procedure doesn't help")
print("   - The gap is due to model behavior, not evaluation")

print("\n2. Possible causes of the structural gap:")
print("   a) Model variance between runs (different random seeds on Kaggle)")
print("   b) Hidden test data with different distribution")
print("   c) Systematic overfitting to training data distribution")
print("   d) Kaggle's evaluation using different metric weighting")

print("\n3. What to try next:")
print("   a) AGGRESSIVE REGULARIZATION - reduce overfitting")
print("   b) SIMPLER MODELS - less variance, better generalization")
print("   c) ENSEMBLE DIVERSITY - reduce variance through diversity")
print("   d) DOMAIN ADAPTATION - address distribution shift")

In [None]:
# Calculate what LB we might get with aggressive regularization
print("\n=== Aggressive Regularization Hypothesis ===")
print("\nIf the CV-LB gap is due to overfitting:")
print("  - Stronger regularization should INCREASE CV (worse locally)")
print("  - But DECREASE LB (better generalization)")
print("  - The CV-LB ratio should decrease")

print("\nCurrent best:")
print(f"  CV: 0.0083, LB: 0.0877")
print(f"  Ratio: {0.0877/0.0083:.1f}x")

print("\nIf regularization works:")
print("  CV might increase to 0.012-0.015")
print("  But LB might decrease to 0.06-0.07")
print(f"  New ratio: {0.065/0.013:.1f}x (better!)")

print("\nTarget: 0.0347")
print("Gap to close: 2.53x (0.0877 → 0.0347)")

In [None]:
# What regularization changes to try?
print("\n=== Aggressive Regularization Experiment Design ===")
print("\nCurrent model (GP 0.15 + MLP 0.55 + LGBM 0.3):")
print("  MLP: Dropout 0.2, weight decay 1e-4, hidden [128, 128]")
print("  LGBM: max_depth=6, min_child_samples=10")
print("  GP: Matern kernel, length_scale=1.0")

print("\nAggressive regularization:")
print("  MLP: Dropout 0.5, weight decay 1e-3, hidden [32, 16]")
print("  LGBM: max_depth=3, min_child_samples=20, reg_alpha=1.0")
print("  GP: Larger length scale (2.0), more noise")

print("\nExpected outcome:")
print("  CV will get WORSE (maybe 0.012-0.015)")
print("  But LB might IMPROVE if gap is due to overfitting")

In [None]:
# Alternative: Try the exact "mixall" ensemble
print("\n=== Alternative: Exact 'mixall' Ensemble ===")
print("\nThe 'mixall' kernel uses:")
print("  - MLP + XGBoost + RF + LightGBM")
print("  - Optuna-optimized weights")
print("  - Spange features only (no DRFP, no ACS PCA)")

print("\nWe tested XGBoost in exp_039 but it made CV worse.")
print("However, we used different weights and features.")

print("\nTo fully test the 'mixall' approach:")
print("  1. Use Spange features only (18 features)")
print("  2. Use MLP + XGB + RF + LGBM ensemble")
print("  3. Use Optuna to optimize weights")
print("  4. Use GroupKFold(5) CV")

In [None]:
# Summary of what we know
print("\n" + "="*60)
print("SUMMARY: What We Know After 41 Experiments")
print("="*60)

print("\n1. BEST RESULTS:")
print(f"   Best CV: 0.008199 (exp_038: GP 0.15 + MLP 0.55 + LGBM 0.3)")
print(f"   Best LB: 0.0877 (exp_030)")
print(f"   Target: 0.0347")
print(f"   Gap: 2.53x")

print("\n2. CV-LB RELATIONSHIP:")
print(f"   LB = 4.27 × CV + 0.0527 (R² = 0.967)")
print(f"   Intercept (0.0527) > Target (0.0347)")
print(f"   Current approach CANNOT reach target!")

print("\n3. GROUPKFOLD(5) HYPOTHESIS: DISPROVEN")
print(f"   GroupKFold CV: 0.009237 (only 1.13x higher than LOO)")
print(f"   The gap is STRUCTURAL, not procedural")

print("\n4. WHAT WORKS:")
print("   - GP + MLP + LGBM ensemble")
print("   - Spange + DRFP + ACS PCA features")
print("   - Arrhenius kinetics features")
print("   - TTA for mixtures")

print("\n5. WHAT DOESN'T WORK:")
print("   - k-NN (222% worse)")
print("   - Deep residual MLP (5x worse)")
print("   - Higher GP weight (10.61% worse)")
print("   - Similarity weighting (169% worse)")
print("   - Feature selection alone (16.83% worse)")
print("   - XGBoost addition (6.51% worse)")

print("\n6. NEXT STEPS:")
print("   PRIORITY 1: Aggressive regularization")
print("   PRIORITY 2: Simpler models (reduce variance)")
print("   PRIORITY 3: Submit to verify new CV-LB relationship")

In [None]:
# Final recommendation
print("\n" + "="*60)
print("RECOMMENDATION FOR NEXT EXPERIMENT")
print("="*60)

print("\n**AGGRESSIVE REGULARIZATION EXPERIMENT**")
print("\nRationale:")
print("  The CV-LB gap is structural (not procedural).")
print("  This suggests overfitting to training distribution.")
print("  Stronger regularization should reduce the gap.")

print("\nImplementation:")
print("  1. Keep GP + MLP + LGBM ensemble")
print("  2. Increase MLP dropout: 0.2 → 0.5")
print("  3. Increase MLP weight decay: 1e-4 → 1e-3")
print("  4. Reduce MLP hidden: [128, 128] → [32, 16]")
print("  5. Reduce LGBM max_depth: 6 → 3")
print("  6. Increase LGBM min_child_samples: 10 → 20")
print("  7. Increase LGBM reg_alpha: 0.1 → 1.0")

print("\nExpected outcome:")
print("  CV will get WORSE (maybe 0.012-0.015)")
print("  But LB might IMPROVE if gap is due to overfitting")
print("  If LB improves, this confirms the overfitting hypothesis")

print("\n**THE TARGET IS REACHABLE.**")
print("We just need to find the right regularization level.")