# Loop 3 Analysis: CatBoost vs Ensemble Performance

## Key Finding from Evaluator
The evaluator correctly identified that CatBoost alone (0.81836) outperforms the simple averaging ensemble (0.81353) by 0.48%.

## Analysis Goals
1. Verify CatBoost superiority
2. Explore weighted ensemble options
3. Analyze threshold tuning potential
4. Plan next steps

In [1]:
# Analyze the experiment results
import pandas as pd
import numpy as np

# Results from exp_002 (3-model ensemble)
results = {
    'Model': ['XGBoost', 'LightGBM', 'CatBoost', 'Simple Avg Ensemble'],
    'CV_Score': [0.80927, 0.80743, 0.81836, 0.81353],
    'CV_Std': [0.00656, 0.00612, 0.00431, np.nan]
}

df = pd.DataFrame(results)
df['Rank'] = df['CV_Score'].rank(ascending=False)
print("Model Performance Comparison:")
print(df.to_string(index=False))
print(f"\nCatBoost vs Ensemble: {0.81836 - 0.81353:.5f} = +0.48% improvement")
print(f"CatBoost vs XGBoost: {0.81836 - 0.80927:.5f} = +0.91% improvement")

Model Performance Comparison:
              Model  CV_Score  CV_Std  Rank
            XGBoost   0.80927 0.00656   3.0
           LightGBM   0.80743 0.00612   4.0
           CatBoost   0.81836 0.00431   1.0
Simple Avg Ensemble   0.81353     NaN   2.0

CatBoost vs Ensemble: 0.00483 = +0.48% improvement
CatBoost vs XGBoost: 0.00909 = +0.91% improvement


In [2]:
# Why is simple averaging worse than CatBoost?
# When one model is significantly better, equal weighting drags it down

# Let's simulate different weighting schemes
def simulate_weighted_ensemble(weights, scores):
    """Simulate weighted ensemble score (approximation)"""
    # This is a rough approximation - actual ensemble would need OOF predictions
    return sum(w * s for w, s in zip(weights, scores))

xgb_score = 0.80927
lgb_score = 0.80743
cat_score = 0.81836

print("Simulated Weighted Ensemble Scores (approximation):")
print("="*60)

# Different weighting schemes
schemes = [
    ('Equal (1/3, 1/3, 1/3)', [1/3, 1/3, 1/3]),
    ('CatBoost heavy (0.6, 0.2, 0.2)', [0.2, 0.2, 0.6]),
    ('CatBoost heavier (0.7, 0.15, 0.15)', [0.15, 0.15, 0.7]),
    ('CatBoost only (0, 0, 1)', [0, 0, 1]),
    ('Drop LightGBM (0.3, 0, 0.7)', [0.3, 0, 0.7]),
    ('XGB+Cat only (0.3, 0, 0.7)', [0.3, 0, 0.7]),
]

for name, weights in schemes:
    score = simulate_weighted_ensemble(weights, [xgb_score, lgb_score, cat_score])
    print(f"{name}: {score:.5f}")

Simulated Weighted Ensemble Scores (approximation):
Equal (1/3, 1/3, 1/3): 0.81169
CatBoost heavy (0.6, 0.2, 0.2): 0.81436
CatBoost heavier (0.7, 0.15, 0.15): 0.81536
CatBoost only (0, 0, 1): 0.81836
Drop LightGBM (0.3, 0, 0.7): 0.81563
XGB+Cat only (0.3, 0, 0.7): 0.81563


In [3]:
# CV-LB Gap Analysis
print("CV-LB Gap Analysis:")
print("="*60)

# From exp_000 submission
cv_exp000 = 0.80674
lb_exp000 = 0.79705
gap = cv_exp000 - lb_exp000

print(f"exp_000: CV={cv_exp000:.5f}, LB={lb_exp000:.5f}, Gap={gap:.5f} ({gap/cv_exp000*100:.1f}% overestimate)")

# Predicted LB scores
print(f"\nPredicted LB scores (assuming {gap:.5f} gap):")
print(f"  CatBoost (CV=0.81836): Predicted LB ≈ {0.81836 - gap:.5f}")
print(f"  Ensemble (CV=0.81353): Predicted LB ≈ {0.81353 - gap:.5f}")
print(f"  XGBoost (CV=0.80927): Predicted LB ≈ {0.80927 - gap:.5f}")

print(f"\nTop LB scores in competition: ~0.8066")
print(f"Our CatBoost predicted LB: {0.81836 - gap:.5f}")
print(f"Difference from top: {(0.81836 - gap) - 0.8066:.5f}")

CV-LB Gap Analysis:
exp_000: CV=0.80674, LB=0.79705, Gap=0.00969 (1.2% overestimate)

Predicted LB scores (assuming 0.00969 gap):
  CatBoost (CV=0.81836): Predicted LB ≈ 0.80867
  Ensemble (CV=0.81353): Predicted LB ≈ 0.80384
  XGBoost (CV=0.80927): Predicted LB ≈ 0.79958

Top LB scores in competition: ~0.8066
Our CatBoost predicted LB: 0.80867
Difference from top: 0.00207


In [4]:
# Target Score Reality Check
print("Target Score Reality Check:")
print("="*60)
print(f"Target score: 0.9642 (96.42% accuracy)")
print(f"Top LB scores: ~0.8066 (80.66% accuracy)")
print(f"Our best CV: 0.81836 (81.84% accuracy)")
print(f"\nThe target of 0.9642 is UNREALISTIC for this competition.")
print(f"Top solutions achieve ~80.7% accuracy.")
print(f"Our CatBoost CV of 0.81836 is EXCELLENT - likely top 5% territory.")
print(f"\nFocus should be on:")
print(f"1. Maximizing CV score (currently 0.81836)")
print(f"2. Ensuring good CV-LB correlation")
print(f"3. Submitting best candidate to verify LB performance")

Target Score Reality Check:
Target score: 0.9642 (96.42% accuracy)
Top LB scores: ~0.8066 (80.66% accuracy)
Our best CV: 0.81836 (81.84% accuracy)

The target of 0.9642 is UNREALISTIC for this competition.
Top solutions achieve ~80.7% accuracy.
Our CatBoost CV of 0.81836 is EXCELLENT - likely top 5% territory.

Focus should be on:
1. Maximizing CV score (currently 0.81836)
2. Ensuring good CV-LB correlation
3. Submitting best candidate to verify LB performance


In [5]:
# Next Steps Analysis
print("Strategic Analysis - Next Steps:")
print("="*60)

print("\n1. IMMEDIATE WIN: Submit CatBoost-only predictions")
print("   - CatBoost CV: 0.81836 (best individual model)")
print("   - Predicted LB: ~0.8087")
print("   - No additional work needed - just use CatBoost predictions")

print("\n2. POTENTIAL IMPROVEMENTS:")
print("   a) Tune CatBoost hyperparameters (current: depth=6, lr=0.05)")
print("      - Try Optuna optimization for CatBoost specifically")
print("      - Potential gain: 0.2-0.5%")
print("   b) Weighted ensemble (0.7 CatBoost + 0.15 XGB + 0.15 LGB)")
print("      - May not beat CatBoost alone given performance gap")
print("   c) Threshold tuning")
print("      - Current: 0.5 default")
print("      - Optimize to match training distribution (~50.4%)")
print("   d) Additional feature engineering")
print("      - Name-based features (surname clustering)")
print("      - More interaction terms")

print("\n3. WHAT NOT TO TRY:")
print("   - Simple averaging ensemble (already proven worse)")
print("   - LightGBM focus (underperforms both XGB and CatBoost)")
print("   - Neural networks (unlikely to beat GBMs on this tabular data)")

Strategic Analysis - Next Steps:

1. IMMEDIATE WIN: Submit CatBoost-only predictions
   - CatBoost CV: 0.81836 (best individual model)
   - Predicted LB: ~0.8087
   - No additional work needed - just use CatBoost predictions

2. POTENTIAL IMPROVEMENTS:
   a) Tune CatBoost hyperparameters (current: depth=6, lr=0.05)
      - Try Optuna optimization for CatBoost specifically
      - Potential gain: 0.2-0.5%
   b) Weighted ensemble (0.7 CatBoost + 0.15 XGB + 0.15 LGB)
      - May not beat CatBoost alone given performance gap
   c) Threshold tuning
      - Current: 0.5 default
      - Optimize to match training distribution (~50.4%)
   d) Additional feature engineering
      - Name-based features (surname clustering)
      - More interaction terms

3. WHAT NOT TO TRY:
   - Simple averaging ensemble (already proven worse)
   - LightGBM focus (underperforms both XGB and CatBoost)
   - Neural networks (unlikely to beat GBMs on this tabular data)


In [6]:
# Summary of key findings
print("\n" + "="*60)
print("SUMMARY OF KEY FINDINGS")
print("="*60)

findings = [
    "1. CatBoost (0.81836) is the BEST model, beating ensemble (0.81353) by 0.48%",
    "2. Simple averaging hurts when one model dominates",
    "3. CatBoost has lowest variance (std=0.00431) - most stable",
    "4. CV-LB gap is ~0.97% (CV overestimates LB)",
    "5. Target of 0.9642 is unrealistic - top LB is ~0.8066",
    "6. Our CatBoost CV of 0.81836 is competitive with top solutions",
]

for f in findings:
    print(f)

print("\nRECOMMENDED ACTION:")
print("Create CatBoost-only submission and submit to verify LB performance")


SUMMARY OF KEY FINDINGS
1. CatBoost (0.81836) is the BEST model, beating ensemble (0.81353) by 0.48%
2. Simple averaging hurts when one model dominates
3. CatBoost has lowest variance (std=0.00431) - most stable
4. CV-LB gap is ~0.97% (CV overestimates LB)
5. Target of 0.9642 is unrealistic - top LB is ~0.8066
6. Our CatBoost CV of 0.81836 is competitive with top solutions

RECOMMENDED ACTION:
Create CatBoost-only submission and submit to verify LB performance
