# Loop 8 LB Feedback Analysis

## Submission Results
- **exp_004** (best CV): CV 0.0623 → LB 0.0956 (gap: +53%)
- **exp_006** (intermediate regularization): CV 0.0689 → LB 0.0991 (gap: +44%)

## Key Observations
1. Both submissions have large CV-LB gaps (44-53%)
2. Intermediate regularization (exp_006) has WORSE LB despite better regularization
3. The gap suggests test set has fundamentally different solvents

## Questions to Answer
1. What's the pattern in CV-LB gap?
2. What approaches from top kernels should we try?
3. How can we reduce the gap?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# CV-LB Gap Analysis
submissions = [
    {'name': 'exp_004 (best CV)', 'cv': 0.0623, 'lb': 0.0956},
    {'name': 'exp_006 (intermediate reg)', 'cv': 0.0689, 'lb': 0.0991},
]

df = pd.DataFrame(submissions)
df['gap'] = df['lb'] - df['cv']
df['gap_pct'] = (df['lb'] - df['cv']) / df['cv'] * 100

print("=== CV-LB Gap Analysis ===")
print(df.to_string(index=False))
print(f"\nAverage gap: {df['gap'].mean():.4f} ({df['gap_pct'].mean():.1f}%)")
print(f"\nTarget: 0.01727")
print(f"Best LB: {df['lb'].min():.4f}")
print(f"Gap to target: {(df['lb'].min() - 0.01727) / 0.01727 * 100:.1f}%")

=== CV-LB Gap Analysis ===
                      name     cv     lb    gap   gap_pct
         exp_004 (best CV) 0.0623 0.0956 0.0333 53.451043
exp_006 (intermediate reg) 0.0689 0.0991 0.0302 43.831640

Average gap: 0.0318 (48.6%)

Target: 0.01727
Best LB: 0.0956
Gap to target: 453.6%


In [2]:
# Key insight: More regularization (exp_006) gave WORSE LB!
# This suggests the problem is NOT overfitting in the traditional sense
# The test set likely has chemically unique solvents that require BETTER features, not simpler models

print("=== Critical Insight ===")
print("exp_004 (less regularized): CV 0.0623 → LB 0.0956")
print("exp_006 (more regularized): CV 0.0689 → LB 0.0991")
print("\nMore regularization made LB WORSE!")
print("This suggests:")
print("1. The problem is NOT traditional overfitting")
print("2. We need BETTER features that generalize to new solvents")
print("3. Simpler models lose signal without reducing the gap")

=== Critical Insight ===
exp_004 (less regularized): CV 0.0623 → LB 0.0956
exp_006 (more regularized): CV 0.0689 → LB 0.0991

More regularization made LB WORSE!
This suggests:
1. The problem is NOT traditional overfitting
2. We need BETTER features that generalize to new solvents
3. Simpler models lose signal without reducing the gap


In [3]:
# What the top kernels do differently:
# 1. lishellliang kernel: MLP + XGBoost + RF + LightGBM ensemble with learned weights
# 2. Uses GroupKFold (5-fold) instead of Leave-One-Out
# 3. Uses Optuna for hyperparameter tuning
# 4. Uses Spange descriptors

print("=== Top Kernel Approaches ===")
print("\n1. lishellliang kernel (good CV/LB):")
print("   - Ensemble: MLP + XGBoost + RF + LightGBM")
print("   - Weighted averaging with learned weights")
print("   - GroupKFold (5-fold) for validation")
print("   - Optuna hyperparameter tuning")
print("   - Spange descriptors")
print("\n2. Key differences from our approach:")
print("   - We use per-target models, they use single ensemble for all targets")
print("   - We use Leave-One-Out, they use GroupKFold")
print("   - We haven't tried ensemble of diverse model families")

=== Top Kernel Approaches ===

1. lishellliang kernel (good CV/LB):
   - Ensemble: MLP + XGBoost + RF + LightGBM
   - Weighted averaging with learned weights
   - GroupKFold (5-fold) for validation
   - Optuna hyperparameter tuning
   - Spange descriptors

2. Key differences from our approach:
   - We use per-target models, they use single ensemble for all targets
   - We use Leave-One-Out, they use GroupKFold
   - We haven't tried ensemble of diverse model families


In [4]:
# Strategy for next experiments:
# 1. Try ensemble of diverse models (MLP + XGB + RF + LGB)
# 2. Use learned weights for ensemble
# 3. Focus on feature engineering that generalizes

print("=== Recommended Strategy ===")
print("\n1. ENSEMBLE OF DIVERSE MODELS (HIGH PRIORITY)")
print("   - Combine MLP + XGBoost + RandomForest + LightGBM")
print("   - Use weighted averaging (learn weights via Optuna or grid search)")
print("   - This is what top kernels do!")
print("\n2. FEATURE ENGINEERING")
print("   - Focus on features that capture chemical similarity")
print("   - Spange descriptors work well")
print("   - Consider adding interaction features")
print("\n3. VALIDATION")
print("   - GroupKFold may give more realistic CV estimates")
print("   - But our Leave-One-Out is more conservative")
print("\n4. DO NOT TRY")
print("   - More regularization (made LB worse)")
print("   - Simpler models (lose signal without reducing gap)")

=== Recommended Strategy ===

1. ENSEMBLE OF DIVERSE MODELS (HIGH PRIORITY)
   - Combine MLP + XGBoost + RandomForest + LightGBM
   - Use weighted averaging (learn weights via Optuna or grid search)
   - This is what top kernels do!

2. FEATURE ENGINEERING
   - Focus on features that capture chemical similarity
   - Spange descriptors work well
   - Consider adding interaction features

3. VALIDATION
   - GroupKFold may give more realistic CV estimates
   - But our Leave-One-Out is more conservative

4. DO NOT TRY
   - More regularization (made LB worse)
   - Simpler models (lose signal without reducing gap)


In [5]:
# Experiments we have:
experiments = [
    {'name': '001_baseline_ensemble', 'cv': 0.0814, 'model': 'MLP+XGB+LGB+RF'},
    {'name': '002_template_compliant', 'cv': 0.0810, 'model': 'MLP+XGB+LGB+RF'},
    {'name': '003_simple_rf', 'cv': 0.0805, 'model': 'RandomForest'},
    {'name': '004_per_target', 'cv': 0.0813, 'model': 'PerTarget (HGB+ETR)'},
    {'name': '005_no_tta', 'cv': 0.0623, 'model': 'PerTarget (HGB+ETR) NO TTA'},
    {'name': '006_ridge', 'cv': 0.0896, 'model': 'Ridge'},
    {'name': '007_intermediate_reg', 'cv': 0.0689, 'model': 'PerTarget depth=5/7'},
    {'name': '008_gaussian_process', 'cv': 0.0721, 'model': 'GP (Matern)'},
]

df_exp = pd.DataFrame(experiments)
df_exp = df_exp.sort_values('cv')
print("=== All Experiments (sorted by CV) ===")
print(df_exp.to_string(index=False))
print(f"\nBest CV: {df_exp['cv'].min():.4f} ({df_exp[df_exp['cv'] == df_exp['cv'].min()]['name'].values[0]})")

=== All Experiments (sorted by CV) ===
                  name     cv                      model
            005_no_tta 0.0623 PerTarget (HGB+ETR) NO TTA
  007_intermediate_reg 0.0689        PerTarget depth=5/7
  008_gaussian_process 0.0721                GP (Matern)
         003_simple_rf 0.0805               RandomForest
002_template_compliant 0.0810             MLP+XGB+LGB+RF
        004_per_target 0.0813        PerTarget (HGB+ETR)
 001_baseline_ensemble 0.0814             MLP+XGB+LGB+RF
             006_ridge 0.0896                      Ridge

Best CV: 0.0623 (005_no_tta)


In [6]:
# What we haven't tried:
print("=== APPROACHES NOT YET TRIED ===")
print("\n1. ENSEMBLE OF DIVERSE MODEL FAMILIES")
print("   - Combine GP + ETR + HGB + MLP")
print("   - Each model family captures different patterns")
print("   - Weighted averaging can reduce variance")
print("\n2. NEURAL NETWORK WITH BETTER ARCHITECTURE")
print("   - Deeper MLP with residual connections")
print("   - Attention mechanism for solvent features")
print("\n3. STACKING/BLENDING")
print("   - Use predictions from multiple models as features")
print("   - Train meta-learner on stacked predictions")
print("\n4. FEATURE SELECTION/ENGINEERING")
print("   - PCA on combined features")
print("   - Feature importance analysis")
print("   - Interaction features between process and solvent")

=== APPROACHES NOT YET TRIED ===

1. ENSEMBLE OF DIVERSE MODEL FAMILIES
   - Combine GP + ETR + HGB + MLP
   - Each model family captures different patterns
   - Weighted averaging can reduce variance

2. NEURAL NETWORK WITH BETTER ARCHITECTURE
   - Deeper MLP with residual connections
   - Attention mechanism for solvent features

3. STACKING/BLENDING
   - Use predictions from multiple models as features
   - Train meta-learner on stacked predictions

4. FEATURE SELECTION/ENGINEERING
   - PCA on combined features
   - Feature importance analysis
   - Interaction features between process and solvent


## Conclusions

### Key Insight
More regularization made LB WORSE (0.0956 → 0.0991). This means:
1. The problem is NOT traditional overfitting
2. We need BETTER features, not simpler models
3. The test set has chemically unique solvents

### Recommended Next Steps
1. **Ensemble of diverse models** - Combine GP + ETR + HGB + MLP with learned weights
2. **Feature engineering** - Focus on features that capture chemical similarity
3. **Stacking** - Use predictions from multiple models as features

### What NOT to Try
- More regularization (made LB worse)
- Simpler models (lose signal without reducing gap)
- GP alone (CV worse than tree-based)