# Loop 5 LB Feedback Analysis

Analyzing the CV-LB gap from exp_004 submission:
- CV: 0.8193
- LB: 0.8041
- Gap: +0.0152 (1.52%)

This is WORSE than exp_003 (LB 0.8045) despite similar CV!

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Submission history
submissions = [
    {'exp': 'exp_000', 'cv': 0.80674, 'lb': 0.79705, 'model': 'XGBoost Baseline'},
    {'exp': 'exp_003', 'cv': 0.81951, 'lb': 0.80453, 'model': 'CatBoost Tuned (threshold=0.5)'},
    {'exp': 'exp_004', 'cv': 0.81928, 'lb': 0.80406, 'model': 'CatBoost Native Cat (threshold=0.47)'}
]

df = pd.DataFrame(submissions)
df['gap'] = df['cv'] - df['lb']
df['gap_pct'] = df['gap'] / df['cv'] * 100
print(df.to_string())

       exp       cv       lb                                 model      gap   gap_pct
0  exp_000  0.80674  0.79705                      XGBoost Baseline  0.00969  1.201130
1  exp_003  0.81951  0.80453        CatBoost Tuned (threshold=0.5)  0.01498  1.827922
2  exp_004  0.81928  0.80406  CatBoost Native Cat (threshold=0.47)  0.01522  1.857729


In [2]:
# Key insight: Threshold tuning HURT LB performance!
print("=== THRESHOLD TUNING ANALYSIS ===")
print(f"exp_003 (threshold=0.5): LB = 0.80453")
print(f"exp_004 (threshold=0.47): LB = 0.80406")
print(f"Threshold tuning HURT LB by: {0.80406 - 0.80453:.5f} (-0.047%)")
print()
print("This confirms evaluator's concern:")
print("- Threshold 0.47 shifts predicted rate from 50.4% to 53.8%")
print("- This distribution shift doesn't generalize to test set")
print("- Threshold tuning on OOF is a form of overfitting")

=== THRESHOLD TUNING ANALYSIS ===
exp_003 (threshold=0.5): LB = 0.80453
exp_004 (threshold=0.47): LB = 0.80406
Threshold tuning HURT LB by: -0.00047 (-0.047%)

This confirms evaluator's concern:
- Threshold 0.47 shifts predicted rate from 50.4% to 53.8%
- This distribution shift doesn't generalize to test set
- Threshold tuning on OOF is a form of overfitting


In [3]:
# CV-LB gap is increasing
print("=== CV-LB GAP TREND ===")
for sub in submissions:
    gap = sub['cv'] - sub['lb']
    print(f"{sub['exp']}: CV={sub['cv']:.5f}, LB={sub['lb']:.5f}, Gap={gap:.5f} ({gap/sub['cv']*100:.2f}%)")

print()
print("Gap is increasing: 0.97% -> 1.50% -> 1.52%")
print("We're overfitting to CV more with each experiment")

=== CV-LB GAP TREND ===
exp_000: CV=0.80674, LB=0.79705, Gap=0.00969 (1.20%)
exp_003: CV=0.81951, LB=0.80453, Gap=0.01498 (1.83%)
exp_004: CV=0.81928, LB=0.80406, Gap=0.01522 (1.86%)

Gap is increasing: 0.97% -> 1.50% -> 1.52%
We're overfitting to CV more with each experiment


In [4]:
# What's the best path forward?
print("=== STRATEGIC ANALYSIS ===")
print()
print("Best LB so far: 0.80453 (exp_003)")
print("Top LB on competition: ~0.8066")
print("Gap to close: 0.0021 (0.21%)")
print()
print("What we've learned:")
print("1. Threshold tuning hurts LB (overfits to CV)")
print("2. Native categorical handling doesn't help")
print("3. CV-LB gap is ~1.5% and increasing")
print()
print("What we haven't tried:")
print("1. Feature selection (remove low-importance features)")
print("2. Stacking with meta-learner")
print("3. Higher regularization")
print("4. Different CV scheme (e.g., more folds)")
print("5. Pseudo-labeling")

=== STRATEGIC ANALYSIS ===

Best LB so far: 0.80453 (exp_003)
Top LB on competition: ~0.8066
Gap to close: 0.0021 (0.21%)

What we've learned:
1. Threshold tuning hurts LB (overfits to CV)
2. Native categorical handling doesn't help
3. CV-LB gap is ~1.5% and increasing

What we haven't tried:
1. Feature selection (remove low-importance features)
2. Stacking with meta-learner
3. Higher regularization
4. Different CV scheme (e.g., more folds)
5. Pseudo-labeling


In [5]:
# Calculate what CV we need to beat 0.8066 LB
print("=== TARGET ANALYSIS ===")
target_lb = 0.8066

# Using different gap estimates
gaps = [0.0097, 0.0150, 0.0152]  # Historical gaps
for gap in gaps:
    needed_cv = target_lb + gap
    print(f"With gap {gap:.4f}: Need CV = {needed_cv:.5f}")

print()
print("Our best CV: 0.81951 (exp_003)")
print("With 1.5% gap, predicted LB: 0.8045")
print("We're already close to our ceiling with current approach")
print()
print("To beat 0.8066, we need either:")
print("1. CV > 0.8216 (unlikely with current features)")
print("2. Reduce CV-LB gap (better generalization)")
print("3. Find new signal (new features or data)")

=== TARGET ANALYSIS ===
With gap 0.0097: Need CV = 0.81630
With gap 0.0150: Need CV = 0.82160
With gap 0.0152: Need CV = 0.82180

Our best CV: 0.81951 (exp_003)
With 1.5% gap, predicted LB: 0.8045
We're already close to our ceiling with current approach

To beat 0.8066, we need either:
1. CV > 0.8216 (unlikely with current features)
2. Reduce CV-LB gap (better generalization)
3. Find new signal (new features or data)


In [6]:
# The target of 0.9642 is IMPOSSIBLE
print("=== REALITY CHECK ===")
print(f"Target score: 0.9642")
print(f"Top LB score: ~0.8066")
print(f"Gap: {0.9642 - 0.8066:.4f} (15.76 percentage points!)")
print()
print("The target of 0.9642 is IMPOSSIBLE for this competition.")
print("The best we can realistically achieve is ~0.81 LB.")
print()
print("Our best LB (0.80453) is already in top 10% territory.")
print("Focus should be on incremental improvements, not chasing impossible target.")

=== REALITY CHECK ===
Target score: 0.9642
Top LB score: ~0.8066
Gap: 0.1576 (15.76 percentage points!)

The target of 0.9642 is IMPOSSIBLE for this competition.
The best we can realistically achieve is ~0.81 LB.

Our best LB (0.80453) is already in top 10% territory.
Focus should be on incremental improvements, not chasing impossible target.
