# Loop 7 Analysis: Calibration vs CV Trade-off

## Key Question:
Should we submit exp_006 (weighted ensemble) to test if better calibration compensates for lower CV?

## Evaluator's Recommendation:
- Submit exp_006 to test calibration hypothesis
- Prediction rate 50.9% is closest to training (50.4%)
- Even if it doesn't beat exp_003, we learn something valuable

In [1]:
import pandas as pd
import numpy as np

# All experiments with CV, LB, and prediction rates
experiments = [
    {'exp': 'exp_000', 'cv': 0.80674, 'lb': 0.7971, 'pred_rate': None, 'model': 'XGBoost baseline'},
    {'exp': 'exp_003', 'cv': 0.81951, 'lb': 0.8045, 'pred_rate': 0.517, 'model': 'CatBoost Optuna'},
    {'exp': 'exp_004', 'cv': 0.81928, 'lb': 0.8041, 'pred_rate': 0.538, 'model': 'CatBoost threshold'},
    {'exp': 'exp_006', 'cv': 0.81709, 'lb': None, 'pred_rate': 0.509, 'model': 'Weighted ensemble'},
]

df = pd.DataFrame(experiments)
df['gap'] = df['cv'] - df['lb']
df['gap_pct'] = (df['gap'] / df['cv']) * 100
print("Experiment Summary:")
print(df.to_string(index=False))
print(f"\nTraining transported rate: 50.36%")

Experiment Summary:
    exp      cv     lb  pred_rate              model     gap  gap_pct
exp_000 0.80674 0.7971        NaN   XGBoost baseline 0.00964 1.194933
exp_003 0.81951 0.8045      0.517    CatBoost Optuna 0.01501 1.831582
exp_004 0.81928 0.8041      0.538 CatBoost threshold 0.01518 1.852846
exp_006 0.81709    NaN      0.509  Weighted ensemble     NaN      NaN

Training transported rate: 50.36%


In [2]:
# Analyze prediction rate vs LB performance
print("\n=== PREDICTION RATE vs LB PERFORMANCE ===")
print(f"Training rate: 50.36%")
print()
submitted = df[df['lb'].notna()].copy()
submitted['rate_diff'] = abs(submitted['pred_rate'] - 0.5036) if submitted['pred_rate'].notna().any() else None
print(submitted[['exp', 'pred_rate', 'lb', 'cv', 'gap']].to_string(index=False))

# Pattern analysis
print("\n=== PATTERN ANALYSIS ===")
print("exp_003: pred_rate=51.7% (diff=1.34%) -> LB=0.8045 (BEST)")
print("exp_004: pred_rate=53.8% (diff=3.44%) -> LB=0.8041 (worse)")
print("exp_006: pred_rate=50.9% (diff=0.54%) -> LB=??? (closest to training!)")
print("\nHypothesis: Closer to training rate = better LB")


=== PREDICTION RATE vs LB PERFORMANCE ===
Training rate: 50.36%

    exp  pred_rate     lb      cv     gap
exp_000        NaN 0.7971 0.80674 0.00964
exp_003      0.517 0.8045 0.81951 0.01501
exp_004      0.538 0.8041 0.81928 0.01518

=== PATTERN ANALYSIS ===
exp_003: pred_rate=51.7% (diff=1.34%) -> LB=0.8045 (BEST)
exp_004: pred_rate=53.8% (diff=3.44%) -> LB=0.8041 (worse)
exp_006: pred_rate=50.9% (diff=0.54%) -> LB=??? (closest to training!)

Hypothesis: Closer to training rate = better LB


In [3]:
# Predict LB for exp_006 using different methods
print("\n=== LB PREDICTION FOR exp_006 ===")

# Method 1: Using average CV-LB gap
mean_gap = df[df['lb'].notna()]['gap'].mean()
pred_lb_gap = 0.81709 - mean_gap
print(f"Method 1 (avg gap {mean_gap:.4f}): LB = {pred_lb_gap:.4f}")

# Method 2: Using recent gap (exp_004)
recent_gap = 0.81928 - 0.8041
pred_lb_recent = 0.81709 - recent_gap
print(f"Method 2 (recent gap {recent_gap:.4f}): LB = {pred_lb_recent:.4f}")

# Method 3: Calibration-adjusted (hypothesis: better calibration reduces gap)
# exp_003 gap was 1.50%, exp_004 gap was 1.86% (worse calibration = bigger gap)
# exp_006 has best calibration, so gap might be smaller
calibration_adjusted_gap = 0.012  # Conservative estimate
pred_lb_calibration = 0.81709 - calibration_adjusted_gap
print(f"Method 3 (calibration-adjusted gap {calibration_adjusted_gap:.4f}): LB = {pred_lb_calibration:.4f}")

print(f"\nBest LB so far (exp_003): 0.8045")
print(f"\nRange of predictions: {pred_lb_recent:.4f} - {pred_lb_calibration:.4f}")


=== LB PREDICTION FOR exp_006 ===
Method 1 (avg gap 0.0133): LB = 0.8038
Method 2 (recent gap 0.0152): LB = 0.8019
Method 3 (calibration-adjusted gap 0.0120): LB = 0.8051

Best LB so far (exp_003): 0.8045

Range of predictions: 0.8019 - 0.8051


In [4]:
# Decision analysis
print("\n=== DECISION ANALYSIS ===")
print("\nArguments FOR submitting exp_006:")
print("1. Prediction rate (50.9%) is closest to training (50.4%)")
print("2. We have evidence that prediction rate matters (exp_004 failure)")
print("3. We have 7 submissions remaining - plenty of room to test")
print("4. Even if it fails, we learn about calibration vs CV trade-off")
print("5. Evaluator recommends it")

print("\nArguments AGAINST submitting exp_006:")
print("1. CV (0.81709) is lower than exp_003 (0.81951) by 0.24%")
print("2. If CV-LB gap is constant, this predicts worse LB")
print("3. Could waste a submission")

print("\nMy assessment:")
print("- 40% chance exp_006 beats exp_003 (calibration hypothesis)")
print("- 60% chance exp_006 underperforms (CV dominates)")
print("- Either way, we learn something valuable")
print("- With 7 submissions remaining, this is a good use of quota")


=== DECISION ANALYSIS ===

Arguments FOR submitting exp_006:
1. Prediction rate (50.9%) is closest to training (50.4%)
2. We have evidence that prediction rate matters (exp_004 failure)
3. We have 7 submissions remaining - plenty of room to test
4. Even if it fails, we learn about calibration vs CV trade-off
5. Evaluator recommends it

Arguments AGAINST submitting exp_006:
1. CV (0.81709) is lower than exp_003 (0.81951) by 0.24%
2. If CV-LB gap is constant, this predicts worse LB
3. Could waste a submission

My assessment:
- 40% chance exp_006 beats exp_003 (calibration hypothesis)
- 60% chance exp_006 underperforms (CV dominates)
- Either way, we learn something valuable
- With 7 submissions remaining, this is a good use of quota


In [5]:
# What else should we try?
print("\n=== UNEXPLORED APPROACHES ===")
print("\n1. FEATURE SELECTION (high priority)")
print("   - 22 features have importance < 1.0")
print("   - Could reduce overfitting")
print("   - Not yet tried")

print("\n2. REGULARIZED CATBOOST")
print("   - depth=6 (vs 8), l2_leaf_reg=5.0 (vs 3.52)")
print("   - subsample=0.8 for randomness")
print("   - Partially tried in loop 5 analysis but not as full experiment")

print("\n3. DIFFERENT ENSEMBLE WEIGHTS")
print("   - Current: 0.6*CatBoost + 0.2*XGB + 0.2*LGB")
print("   - Could try: 0.7*CatBoost + 0.15*XGB + 0.15*LGB")
print("   - Or: 0.5*CatBoost + 0.25*XGB + 0.25*LGB")

print("\n4. NEURAL NETWORK")
print("   - Not tried at all")
print("   - Could add diversity for ensembling")
print("   - TabNet or simple MLP")

print("\n5. PSEUDO-LABELING")
print("   - Use confident predictions on test set")
print("   - Retrain with pseudo-labels")
print("   - Risky but could help")


=== UNEXPLORED APPROACHES ===

1. FEATURE SELECTION (high priority)
   - 22 features have importance < 1.0
   - Could reduce overfitting
   - Not yet tried

2. REGULARIZED CATBOOST
   - depth=6 (vs 8), l2_leaf_reg=5.0 (vs 3.52)
   - subsample=0.8 for randomness
   - Partially tried in loop 5 analysis but not as full experiment

3. DIFFERENT ENSEMBLE WEIGHTS
   - Current: 0.6*CatBoost + 0.2*XGB + 0.2*LGB
   - Could try: 0.7*CatBoost + 0.15*XGB + 0.15*LGB
   - Or: 0.5*CatBoost + 0.25*XGB + 0.25*LGB

4. NEURAL NETWORK
   - Not tried at all
   - Could add diversity for ensembling
   - TabNet or simple MLP

5. PSEUDO-LABELING
   - Use confident predictions on test set
   - Retrain with pseudo-labels
   - Risky but could help


In [6]:
# Strategy recommendation
print("\n=== STRATEGY RECOMMENDATION ===")
print("\n1. SUBMIT exp_006 to test calibration hypothesis")
print("   - Quick feedback on whether calibration > CV")
print("   - Uses 1 of 7 remaining submissions")

print("\n2. NEXT EXPERIMENT: Feature Selection + CatBoost")
print("   - Remove 22 low-importance features")
print("   - Retrain CatBoost with best params")
print("   - Check if reduced feature set improves generalization")

print("\n3. IF exp_006 beats exp_003:")
print("   - Calibration matters more than CV")
print("   - Focus on approaches that improve calibration")
print("   - Try different ensemble weights")

print("\n4. IF exp_006 doesn't beat exp_003:")
print("   - CV is more important than calibration")
print("   - Focus on feature selection + regularization")
print("   - Try to improve CV while maintaining reasonable calibration")

print("\n5. LONG-TERM: Build diverse models for final ensemble")
print("   - CatBoost (best single model)")
print("   - Feature-selected CatBoost")
print("   - Regularized CatBoost")
print("   - Maybe Neural Network for diversity")


=== STRATEGY RECOMMENDATION ===

1. SUBMIT exp_006 to test calibration hypothesis
   - Quick feedback on whether calibration > CV
   - Uses 1 of 7 remaining submissions

2. NEXT EXPERIMENT: Feature Selection + CatBoost
   - Remove 22 low-importance features
   - Retrain CatBoost with best params
   - Check if reduced feature set improves generalization

3. IF exp_006 beats exp_003:
   - Calibration matters more than CV
   - Focus on approaches that improve calibration
   - Try different ensemble weights

4. IF exp_006 doesn't beat exp_003:
   - CV is more important than calibration
   - Focus on feature selection + regularization
   - Try to improve CV while maintaining reasonable calibration

5. LONG-TERM: Build diverse models for final ensemble
   - CatBoost (best single model)
   - Feature-selected CatBoost
   - Regularized CatBoost
   - Maybe Neural Network for diversity
