# Loop 8 Analysis: Strategic Assessment

## Current Situation
- Best CV: 0.0623 (exp_004, Per-Target HGB+ETR NO TTA)
- Best LB: 0.0956 (exp_004, 53% CV-LB gap)
- Target: 0.01727 (3.6x gap from best CV)
- Submissions: 1/5 used, 4 remaining

## Key Questions
1. Why is there a 53% CV-LB gap?
2. What approaches haven't been tried?
3. What do top kernels do differently?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Load experiment results
experiments = [
    {'name': 'exp_000', 'model': 'Ensemble (MLP+XGB+LGB+RF)', 'cv': 0.081393},
    {'name': 'exp_001', 'model': 'Ensemble + Poly', 'cv': 0.081044},
    {'name': 'exp_002', 'model': 'RF Regularized', 'cv': 0.08053},
    {'name': 'exp_003', 'model': 'PerTarget (HGB+ETR)', 'cv': 0.08126},
    {'name': 'exp_004', 'model': 'PerTarget NO TTA', 'cv': 0.062265, 'lb': 0.0956},
    {'name': 'exp_005', 'model': 'Ridge (alpha=10)', 'cv': 0.08964},
    {'name': 'exp_006', 'model': 'PerTarget depth=5/7', 'cv': 0.068848},
    {'name': 'exp_007', 'model': 'GP (Matern)', 'cv': 0.072118},
]

df = pd.DataFrame(experiments)
print("Experiment Summary:")
print(df.to_string(index=False))
print(f"\nBest CV: {df['cv'].min():.4f} ({df.loc[df['cv'].idxmin(), 'name']})")

Experiment Summary:
   name                     model       cv     lb
exp_000 Ensemble (MLP+XGB+LGB+RF) 0.081393    NaN
exp_001           Ensemble + Poly 0.081044    NaN
exp_002            RF Regularized 0.080530    NaN
exp_003       PerTarget (HGB+ETR) 0.081260    NaN
exp_004          PerTarget NO TTA 0.062265 0.0956
exp_005          Ridge (alpha=10) 0.089640    NaN
exp_006       PerTarget depth=5/7 0.068848    NaN
exp_007               GP (Matern) 0.072118    NaN

Best CV: 0.0623 (exp_004)


In [2]:
# Analyze CV-LB gap
print("=" * 60)
print("CV-LB GAP ANALYSIS")
print("=" * 60)

best_cv = 0.0623
best_lb = 0.0956
gap = (best_lb - best_cv) / best_cv * 100

print(f"Best CV: {best_cv:.4f}")
print(f"Best LB: {best_lb:.4f}")
print(f"CV-LB Gap: {gap:.1f}%")
print(f"\nThis means our CV is OVERLY OPTIMISTIC by {gap:.1f}%")
print("\nPossible reasons:")
print("1. Leave-one-solvent-out CV doesn't capture true OOD difficulty")
print("2. Test set has more chemically unique solvents")
print("3. Model is overfitting to training solvents")

CV-LB GAP ANALYSIS
Best CV: 0.0623
Best LB: 0.0956
CV-LB Gap: 53.5%

This means our CV is OVERLY OPTIMISTIC by 53.5%

Possible reasons:
1. Leave-one-solvent-out CV doesn't capture true OOD difficulty
2. Test set has more chemically unique solvents
3. Model is overfitting to training solvents


In [3]:
# What approaches haven't been tried?
print("=" * 60)
print("APPROACHES NOT YET TRIED")
print("=" * 60)

approaches_tried = [
    "Ensemble (MLP+XGB+LGB+RF)",
    "Per-target models (HGB+ETR)",
    "TTA (hurt performance)",
    "Strong regularization (Ridge)",
    "Intermediate regularization (depth=5/7)",
    "Gaussian Process (Matern kernel)",
    "Combined features (DRFP-PCA + Spange + ACS_PCA)",
]

approaches_not_tried = [
    "Ensemble of DIVERSE model families (GP + Tree + MLP)",
    "Optuna hyperparameter optimization",
    "Different ensemble weights (learned vs fixed)",
    "Neural network with BatchNorm + Dropout + LR scheduler",
    "Stacking (meta-learner on base model predictions)",
    "Blending (weighted average of diverse models)",
]

print("\nTried:")
for a in approaches_tried:
    print(f"  ✓ {a}")

print("\nNOT Tried:")
for a in approaches_not_tried:
    print(f"  ✗ {a}")

APPROACHES NOT YET TRIED

Tried:
  ✓ Ensemble (MLP+XGB+LGB+RF)
  ✓ Per-target models (HGB+ETR)
  ✓ TTA (hurt performance)
  ✓ Strong regularization (Ridge)
  ✓ Intermediate regularization (depth=5/7)
  ✓ Gaussian Process (Matern kernel)
  ✓ Combined features (DRFP-PCA + Spange + ACS_PCA)

NOT Tried:
  ✗ Ensemble of DIVERSE model families (GP + Tree + MLP)
  ✗ Optuna hyperparameter optimization
  ✗ Different ensemble weights (learned vs fixed)
  ✗ Neural network with BatchNorm + Dropout + LR scheduler
  ✗ Stacking (meta-learner on base model predictions)
  ✗ Blending (weighted average of diverse models)


In [4]:
# Key insight from top kernels
print("=" * 60)
print("INSIGHTS FROM TOP KERNELS")
print("=" * 60)

print("\n1. lishellliang's kernel (good CV/LB):")
print("   - Ensemble: MLP + XGBoost + RF + LightGBM")
print("   - Weighted averaging with learned weights")
print("   - GroupKFold (5-fold) instead of Leave-One-Out")
print("   - Optuna for hyperparameter tuning")
print("   - Uses Spange descriptors")

print("\n2. omarafik's kernel (System Malfunction V1):")
print("   - Simple MLP with BatchNorm + Dropout")
print("   - ReduceLROnPlateau scheduler")
print("   - Uses Spange descriptors")
print("   - 300 epochs training")

print("\n3. Key differences from our approach:")
print("   - We use per-target models (HGB for SM, ETR for Products)")
print("   - They use single ensemble for all targets")
print("   - They use learned ensemble weights")
print("   - They use more sophisticated MLP training")

INSIGHTS FROM TOP KERNELS

1. lishellliang's kernel (good CV/LB):
   - Ensemble: MLP + XGBoost + RF + LightGBM
   - Weighted averaging with learned weights
   - GroupKFold (5-fold) instead of Leave-One-Out
   - Optuna for hyperparameter tuning
   - Uses Spange descriptors

2. omarafik's kernel (System Malfunction V1):
   - Simple MLP with BatchNorm + Dropout
   - ReduceLROnPlateau scheduler
   - Uses Spange descriptors
   - 300 epochs training

3. Key differences from our approach:
   - We use per-target models (HGB for SM, ETR for Products)
   - They use single ensemble for all targets
   - They use learned ensemble weights
   - They use more sophisticated MLP training


In [5]:
# Strategic recommendation
print("=" * 60)
print("STRATEGIC RECOMMENDATION")
print("=" * 60)

print("\n1. SUBMIT exp_006 (intermediate regularization, CV 0.0689)")
print("   - Tests hypothesis that regularization reduces CV-LB gap")
print("   - If LB improves: continue with regularization")
print("   - If LB doesn't improve: pivot to ensemble diversity")

print("\n2. TRY: Ensemble of diverse model families")
print("   - Combine GP + ETR + MLP predictions")
print("   - Weighted average: 0.3*GP + 0.4*ETR + 0.3*MLP")
print("   - May reduce variance and improve generalization")

print("\n3. TRY: Stacking with meta-learner")
print("   - Train base models (GP, ETR, MLP) on folds")
print("   - Train meta-learner (Ridge) on base predictions")
print("   - May capture complementary information")

print("\n4. TRY: Optuna hyperparameter optimization")
print("   - Systematic search for optimal hyperparameters")
print("   - May find better configurations than manual tuning")

STRATEGIC RECOMMENDATION

1. SUBMIT exp_006 (intermediate regularization, CV 0.0689)
   - Tests hypothesis that regularization reduces CV-LB gap
   - If LB improves: continue with regularization
   - If LB doesn't improve: pivot to ensemble diversity

2. TRY: Ensemble of diverse model families
   - Combine GP + ETR + MLP predictions
   - Weighted average: 0.3*GP + 0.4*ETR + 0.3*MLP
   - May reduce variance and improve generalization

3. TRY: Stacking with meta-learner
   - Train base models (GP, ETR, MLP) on folds
   - Train meta-learner (Ridge) on base predictions
   - May capture complementary information

4. TRY: Optuna hyperparameter optimization
   - Systematic search for optimal hyperparameters
   - May find better configurations than manual tuning


In [6]:
# Calculate expected improvement needed
print("=" * 60)
print("GAP TO TARGET ANALYSIS")
print("=" * 60)

target = 0.01727
best_cv = 0.0623
best_lb = 0.0956

print(f"Target: {target:.5f}")
print(f"Best CV: {best_cv:.5f} ({(best_cv/target - 1)*100:.1f}% above target)")
print(f"Best LB: {best_lb:.5f} ({(best_lb/target - 1)*100:.1f}% above target)")

print("\nTo reach target:")
print(f"  - Need to reduce CV by {(1 - target/best_cv)*100:.1f}%")
print(f"  - Need to reduce LB by {(1 - target/best_lb)*100:.1f}%")

print("\nThis is a MASSIVE gap. Possible explanations:")
print("1. Target is based on different evaluation (e.g., different metric)")
print("2. Target is based on private test set with different distribution")
print("3. There's domain knowledge or approach we're missing")
print("4. The winning solution uses fundamentally different approach")

GAP TO TARGET ANALYSIS
Target: 0.01727
Best CV: 0.06230 (260.7% above target)
Best LB: 0.09560 (453.6% above target)

To reach target:
  - Need to reduce CV by 72.3%
  - Need to reduce LB by 81.9%

This is a MASSIVE gap. Possible explanations:
1. Target is based on different evaluation (e.g., different metric)
2. Target is based on private test set with different distribution
3. There's domain knowledge or approach we're missing
4. The winning solution uses fundamentally different approach


In [7]:
# Priority actions
print("=" * 60)
print("PRIORITY ACTIONS FOR NEXT LOOP")
print("=" * 60)

print("\n1. SUBMIT exp_006 to verify CV-LB gap hypothesis")
print("   - CV 0.0689 vs exp_004 CV 0.0623")
print("   - If LB improves proportionally, regularization helps")

print("\n2. TRY: Diverse ensemble (GP + ETR + MLP)")
print("   - Combine fundamentally different model families")
print("   - May capture different aspects of the data")

print("\n3. TRY: Stacking with meta-learner")
print("   - Use base model predictions as features")
print("   - Train simple meta-learner (Ridge/Linear)")

print("\n4. RESEARCH: What do top LB solutions do?")
print("   - Check if there's domain knowledge we're missing")
print("   - Look for physics-informed approaches")

PRIORITY ACTIONS FOR NEXT LOOP

1. SUBMIT exp_006 to verify CV-LB gap hypothesis
   - CV 0.0689 vs exp_004 CV 0.0623
   - If LB improves proportionally, regularization helps

2. TRY: Diverse ensemble (GP + ETR + MLP)
   - Combine fundamentally different model families
   - May capture different aspects of the data

3. TRY: Stacking with meta-learner
   - Use base model predictions as features
   - Train simple meta-learner (Ridge/Linear)

4. RESEARCH: What do top LB solutions do?
   - Check if there's domain knowledge we're missing
   - Look for physics-informed approaches
