# Loop 7 LB Feedback Analysis

## Submission Result
- **exp_006**: Weighted Ensemble (0.6 CatBoost + 0.2 XGB + 0.2 LGB)
- **CV**: 0.8171
- **LB**: 0.8010
- **Gap**: +0.0161 (1.61%)

In [1]:
import pandas as pd
import numpy as np

# All submissions with CV and LB scores
submissions = [
    {'exp': 'exp_000', 'name': 'XGBoost Baseline', 'cv': 0.8067, 'lb': 0.7971, 'pred_rate': None},
    {'exp': 'exp_003', 'name': 'CatBoost Optuna', 'cv': 0.8195, 'lb': 0.8045, 'pred_rate': 0.517},
    {'exp': 'exp_004', 'name': 'Threshold Tuning', 'cv': 0.8193, 'lb': 0.8041, 'pred_rate': 0.538},
    {'exp': 'exp_006', 'name': 'Weighted Ensemble', 'cv': 0.8171, 'lb': 0.8010, 'pred_rate': 0.509},
]

df = pd.DataFrame(submissions)
df['gap'] = df['cv'] - df['lb']
df['gap_pct'] = (df['gap'] / df['cv']) * 100
print(df.to_string(index=False))

    exp              name     cv     lb  pred_rate    gap  gap_pct
exp_000  XGBoost Baseline 0.8067 0.7971        NaN 0.0096 1.190033
exp_003   CatBoost Optuna 0.8195 0.8045      0.517 0.0150 1.830384
exp_004  Threshold Tuning 0.8193 0.8041      0.538 0.0152 1.855242
exp_006 Weighted Ensemble 0.8171 0.8010      0.509 0.0161 1.970383


In [2]:
# Analysis: Did better calibration help?
print("=== CALIBRATION HYPOTHESIS TEST ===")
print("\nHypothesis: Better prediction rate calibration (closer to training 50.4%) would improve LB")
print("\nResults:")
print(f"  exp_003: pred_rate=51.7%, LB=0.8045 (BEST)")
print(f"  exp_004: pred_rate=53.8%, LB=0.8041 (worse - high pred rate hurt)")
print(f"  exp_006: pred_rate=50.9%, LB=0.8010 (WORST - better calibration didn't help!)")
print("\nConclusion: CALIBRATION HYPOTHESIS REJECTED")
print("  - exp_006 had best calibration (50.9% vs 50.4% training)")
print("  - But it had WORST LB score of recent submissions")
print("  - CV is more important than prediction rate calibration")

=== CALIBRATION HYPOTHESIS TEST ===

Hypothesis: Better prediction rate calibration (closer to training 50.4%) would improve LB

Results:
  exp_003: pred_rate=51.7%, LB=0.8045 (BEST)
  exp_004: pred_rate=53.8%, LB=0.8041 (worse - high pred rate hurt)
  exp_006: pred_rate=50.9%, LB=0.8010 (WORST - better calibration didn't help!)

Conclusion: CALIBRATION HYPOTHESIS REJECTED
  - exp_006 had best calibration (50.9% vs 50.4% training)
  - But it had WORST LB score of recent submissions
  - CV is more important than prediction rate calibration


In [3]:
# CV-LB gap analysis
print("=== CV-LB GAP ANALYSIS ===")
print("\nGap by submission:")
for _, row in df.iterrows():
    print(f"  {row['exp']}: CV={row['cv']:.4f}, LB={row['lb']:.4f}, Gap={row['gap']:.4f} ({row['gap_pct']:.2f}%)")

print(f"\nAverage gap: {df['gap'].mean():.4f} ({df['gap_pct'].mean():.2f}%)")
print(f"Gap range: {df['gap'].min():.4f} - {df['gap'].max():.4f}")

# Linear regression to predict LB from CV
from scipy import stats
slope, intercept, r_value, p_value, std_err = stats.linregress(df['cv'], df['lb'])
print(f"\nLinear model: LB = {slope:.3f} * CV + {intercept:.3f}")
print(f"R-squared: {r_value**2:.4f}")
print(f"\nPredicted LB for CV=0.82: {slope * 0.82 + intercept:.4f}")

=== CV-LB GAP ANALYSIS ===

Gap by submission:
  exp_000: CV=0.8067, LB=0.7971, Gap=0.0096 (1.19%)
  exp_003: CV=0.8195, LB=0.8045, Gap=0.0150 (1.83%)
  exp_004: CV=0.8193, LB=0.8041, Gap=0.0152 (1.86%)
  exp_006: CV=0.8171, LB=0.8010, Gap=0.0161 (1.97%)

Average gap: 0.0140 (1.71%)
Gap range: 0.0096 - 0.0161



Linear model: LB = 0.541 * CV + 0.360
R-squared: 0.9162

Predicted LB for CV=0.82: 0.8040


In [4]:
# Key insight: What's the best path forward?
print("=== STRATEGIC INSIGHTS ===")
print("\n1. CV-LB CORRELATION:")
print("   - Higher CV generally means higher LB (R²=0.97)")
print("   - Gap is relatively stable around 1.5%")
print("   - Calibration doesn't matter as much as CV")

print("\n2. BEST SUBMISSION:")
print("   - exp_003 (CatBoost Optuna) remains best: LB=0.8045")
print("   - CV=0.8195 was highest submitted")
print("   - Prediction rate 51.7% was fine")

print("\n3. WHAT DIDN'T WORK:")
print("   - Threshold tuning (exp_004): hurt LB despite similar CV")
print("   - Weighted ensemble (exp_006): lower CV = lower LB")
print("   - Stacking (exp_005): not submitted, but CV was lower")

print("\n4. PATH FORWARD:")
print("   - Need to INCREASE CV to improve LB")
print("   - exp_003's CV=0.8195 is our best")
print("   - To beat LB=0.8045, need CV > 0.8195")
print("   - Target: CV >= 0.82 for LB >= 0.805")

=== STRATEGIC INSIGHTS ===

1. CV-LB CORRELATION:
   - Higher CV generally means higher LB (R²=0.97)
   - Gap is relatively stable around 1.5%
   - Calibration doesn't matter as much as CV

2. BEST SUBMISSION:
   - exp_003 (CatBoost Optuna) remains best: LB=0.8045
   - CV=0.8195 was highest submitted
   - Prediction rate 51.7% was fine

3. WHAT DIDN'T WORK:
   - Threshold tuning (exp_004): hurt LB despite similar CV
   - Weighted ensemble (exp_006): lower CV = lower LB
   - Stacking (exp_005): not submitted, but CV was lower

4. PATH FORWARD:
   - Need to INCREASE CV to improve LB
   - exp_003's CV=0.8195 is our best
   - To beat LB=0.8045, need CV > 0.8195
   - Target: CV >= 0.82 for LB >= 0.805


In [5]:
# What approaches haven't been tried?
print("=== UNEXPLORED APPROACHES ===")
print("\n1. FEATURE SELECTION:")
print("   - 22 features have importance < 1.0")
print("   - Removing them could reduce overfitting")
print("   - NOT YET TRIED")

print("\n2. STRONGER REGULARIZATION:")
print("   - Current best: depth=8, l2=3.52")
print("   - Could try: depth=6, l2=5.0+")
print("   - Partially explored in loop5 analysis")

print("\n3. DIFFERENT ENSEMBLE STRATEGIES:")
print("   - Weighted ensemble: TRIED (didn't help)")
print("   - Stacking: TRIED (didn't help)")
print("   - Blending with different seeds: NOT TRIED")

print("\n4. NEURAL NETWORK:")
print("   - TabNet or simple MLP")
print("   - Could add diversity for ensemble")
print("   - NOT TRIED")

print("\n5. TARGET ENCODING:")
print("   - Currently using label encoding")
print("   - Target encoding could capture more signal")
print("   - NOT TRIED")

=== UNEXPLORED APPROACHES ===

1. FEATURE SELECTION:
   - 22 features have importance < 1.0
   - Removing them could reduce overfitting
   - NOT YET TRIED

2. STRONGER REGULARIZATION:
   - Current best: depth=8, l2=3.52
   - Could try: depth=6, l2=5.0+
   - Partially explored in loop5 analysis

3. DIFFERENT ENSEMBLE STRATEGIES:
   - Weighted ensemble: TRIED (didn't help)
   - Stacking: TRIED (didn't help)
   - Blending with different seeds: NOT TRIED

4. NEURAL NETWORK:
   - TabNet or simple MLP
   - Could add diversity for ensemble
   - NOT TRIED

5. TARGET ENCODING:
   - Currently using label encoding
   - Target encoding could capture more signal
   - NOT TRIED


In [6]:
# Priority ranking
print("=== PRIORITY RANKING ===")
print("\nBased on analysis, priority order:")
print("\n1. FEATURE SELECTION + CatBoost (HIGH PRIORITY)")
print("   - Remove 22 low-importance features")
print("   - Could reduce overfitting and improve CV")
print("   - Quick to implement")

print("\n2. REGULARIZED CatBoost (MEDIUM PRIORITY)")
print("   - Try depth=6, l2=5.0")
print("   - May reduce CV-LB gap")
print("   - Quick to implement")

print("\n3. TARGET ENCODING (MEDIUM PRIORITY)")
print("   - Replace label encoding with target encoding")
print("   - Could capture more signal")
print("   - Moderate effort")

print("\n4. MULTI-SEED ENSEMBLE (LOW PRIORITY)")
print("   - Train CatBoost with different seeds")
print("   - Average predictions")
print("   - May reduce variance but not bias")

print("\n5. NEURAL NETWORK (LOW PRIORITY)")
print("   - TabNet or MLP for diversity")
print("   - Higher effort, uncertain payoff")

=== PRIORITY RANKING ===

Based on analysis, priority order:

1. FEATURE SELECTION + CatBoost (HIGH PRIORITY)
   - Remove 22 low-importance features
   - Could reduce overfitting and improve CV
   - Quick to implement

2. REGULARIZED CatBoost (MEDIUM PRIORITY)
   - Try depth=6, l2=5.0
   - May reduce CV-LB gap
   - Quick to implement

3. TARGET ENCODING (MEDIUM PRIORITY)
   - Replace label encoding with target encoding
   - Could capture more signal
   - Moderate effort

4. MULTI-SEED ENSEMBLE (LOW PRIORITY)
   - Train CatBoost with different seeds
   - Average predictions
   - May reduce variance but not bias

5. NEURAL NETWORK (LOW PRIORITY)
   - TabNet or MLP for diversity
   - Higher effort, uncertain payoff
