# Loop 8 Analysis: Understanding CV Variance and Next Steps

## Key Observations from Evaluator:
1. CV variance is high (~0.3% with same params) - exp_003 (0.81951) vs exp_007 (0.81617)
2. Feature selection HURT CV (-0.00265)
3. We're in a local optimum - last 4 experiments failed to beat exp_003
4. CV-LB gap is increasing (1.2% → 2.0%)

## Questions to Answer:
1. Is the CV variance due to random seed or something else?
2. What approaches haven't been tried?
3. What's the realistic path to beating exp_003?

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings('ignore')

# Load experiment history
experiments = [
    {'id': 'exp_000', 'cv': 0.80674, 'lb': 0.7971, 'model': 'XGBoost baseline'},
    {'id': 'exp_001', 'cv': 0.80927, 'lb': None, 'model': 'XGBoost + features'},
    {'id': 'exp_002', 'cv': 0.81353, 'lb': None, 'model': '3-model ensemble'},
    {'id': 'exp_003', 'cv': 0.81951, 'lb': 0.8045, 'model': 'CatBoost Optuna'},
    {'id': 'exp_004', 'cv': 0.81928, 'lb': 0.8041, 'model': 'CatBoost + threshold'},
    {'id': 'exp_005', 'cv': 0.81744, 'lb': None, 'model': 'Stacking'},
    {'id': 'exp_006', 'cv': 0.81709, 'lb': 0.8010, 'model': 'Weighted ensemble'},
    {'id': 'exp_007', 'cv': 0.81617, 'lb': None, 'model': 'Feature selection'},
]

df = pd.DataFrame(experiments)
print("Experiment History:")
print(df.to_string(index=False))

Experiment History:
     id      cv     lb                model
exp_000 0.80674 0.7971     XGBoost baseline
exp_001 0.80927    NaN   XGBoost + features
exp_002 0.81353    NaN     3-model ensemble
exp_003 0.81951 0.8045      CatBoost Optuna
exp_004 0.81928 0.8041 CatBoost + threshold
exp_005 0.81744    NaN             Stacking
exp_006 0.81709 0.8010    Weighted ensemble
exp_007 0.81617    NaN    Feature selection


In [2]:
# Analyze CV trajectory
print("\n=== CV Trajectory Analysis ===")
print(f"Best CV: {df['cv'].max():.5f} (exp_003)")
print(f"Latest CV: {df['cv'].iloc[-1]:.5f} (exp_007)")
print(f"CV range: {df['cv'].min():.5f} - {df['cv'].max():.5f}")
print(f"CV std: {df['cv'].std():.5f}")

# Experiments after exp_003
post_003 = df[df['id'].isin(['exp_004', 'exp_005', 'exp_006', 'exp_007'])]
print(f"\nPost-exp_003 experiments (4 total):")
print(f"  All failed to beat exp_003's CV of 0.81951")
print(f"  CV range: {post_003['cv'].min():.5f} - {post_003['cv'].max():.5f}")


=== CV Trajectory Analysis ===
Best CV: 0.81951 (exp_003)
Latest CV: 0.81617 (exp_007)
CV range: 0.80674 - 0.81951
CV std: 0.00468

Post-exp_003 experiments (4 total):
  All failed to beat exp_003's CV of 0.81951
  CV range: 0.81617 - 0.81928


In [3]:
# CV-LB gap analysis for submitted experiments
submitted = df[df['lb'].notna()]
print("\n=== CV-LB Gap Analysis ===")
for _, row in submitted.iterrows():
    gap = row['cv'] - row['lb']
    gap_pct = gap / row['cv'] * 100
    print(f"{row['id']}: CV={row['cv']:.5f}, LB={row['lb']:.5f}, Gap={gap:.5f} ({gap_pct:.2f}%)")

# Linear regression to predict LB from CV
from scipy import stats
slope, intercept, r_value, p_value, std_err = stats.linregress(submitted['cv'], submitted['lb'])
print(f"\nLinear model: LB = {slope:.3f} * CV + {intercept:.3f}")
print(f"R² = {r_value**2:.3f}")
print(f"\nTo beat LB 0.8045, need CV > {(0.8045 - intercept) / slope:.5f}")


=== CV-LB Gap Analysis ===
exp_000: CV=0.80674, LB=0.79710, Gap=0.00964 (1.19%)
exp_003: CV=0.81951, LB=0.80450, Gap=0.01501 (1.83%)
exp_004: CV=0.81928, LB=0.80410, Gap=0.01518 (1.85%)
exp_006: CV=0.81709, LB=0.80100, Gap=0.01609 (1.97%)

Linear model: LB = 0.543 * CV + 0.359
R² = 0.917

To beat LB 0.8045, need CV > 0.82086


In [4]:
# What approaches haven't been tried?
print("\n=== Untried Approaches ===")
untried = [
    "1. Multi-seed CatBoost ensemble (average 5 seeds)",
    "2. Target encoding for categorical features",
    "3. Stronger regularization (depth=6, l2=5.0, subsample=0.8)",
    "4. Neural network (for diversity in ensemble)",
    "5. KNN imputation (mentioned in top solutions)",
    "6. Additional interaction features (Age × Spending, Deck × HomePlanet)",
    "7. Pseudo-labeling (use confident test predictions)",
]
for approach in untried:
    print(f"  {approach}")


=== Untried Approaches ===
  1. Multi-seed CatBoost ensemble (average 5 seeds)
  2. Target encoding for categorical features
  3. Stronger regularization (depth=6, l2=5.0, subsample=0.8)
  4. Neural network (for diversity in ensemble)
  5. KNN imputation (mentioned in top solutions)
  6. Additional interaction features (Age × Spending, Deck × HomePlanet)
  7. Pseudo-labeling (use confident test predictions)


In [5]:
# Evaluator's recommendation: Multi-seed ensemble
print("\n=== Evaluator's Top Priority: Multi-Seed CatBoost ===")
print("""
Rationale:
1. CV variance (~0.3%) is a significant problem
2. Averaging multiple seeds will give more stable estimates
3. Multi-seed ensembling reduces variance without changing bias
4. Quick to implement - just train same model with 5 different seeds

Implementation:
- Train CatBoost with seeds [42, 123, 456, 789, 1000]
- Average predictions across all seeds
- This should reduce variance and potentially improve both CV and LB
""")


=== Evaluator's Top Priority: Multi-Seed CatBoost ===

Rationale:
1. CV variance (~0.3%) is a significant problem
2. Averaging multiple seeds will give more stable estimates
3. Multi-seed ensembling reduces variance without changing bias
4. Quick to implement - just train same model with 5 different seeds

Implementation:
- Train CatBoost with seeds [42, 123, 456, 789, 1000]
- Average predictions across all seeds
- This should reduce variance and potentially improve both CV and LB



In [6]:
# Alternative: Target encoding
print("\n=== Alternative: Target Encoding ===")
print("""
Rationale:
1. Currently using label encoding for categoricals
2. Target encoding captures category-target relationships
3. Top solutions mention target encoding as effective
4. Must use proper CV to avoid leakage

Categorical features to encode:
- HomePlanet (3 values)
- Destination (3 values)
- Deck (8 values)
- Side (2 values)
- CryoSleep (2 values)
- VIP (2 values)
- Interaction features (CryoSleep_HomePlanet, etc.)
""")


=== Alternative: Target Encoding ===

Rationale:
1. Currently using label encoding for categoricals
2. Target encoding captures category-target relationships
3. Top solutions mention target encoding as effective
4. Must use proper CV to avoid leakage

Categorical features to encode:
- HomePlanet (3 values)
- Destination (3 values)
- Deck (8 values)
- Side (2 values)
- CryoSleep (2 values)
- VIP (2 values)
- Interaction features (CryoSleep_HomePlanet, etc.)



In [7]:
# The target score is unrealistic
print("\n=== Reality Check: Target Score ===")
print(f"Target: 0.9642")
print(f"Top LB: ~0.8066")
print(f"Our best LB: 0.8045")
print(f"Gap to top: {0.8066 - 0.8045:.4f} (0.26%)")
print(f"Gap to target: {0.9642 - 0.8045:.4f} (19.8%)")
print("\nThe target of 0.9642 is IMPOSSIBLE.")
print("The realistic goal is to improve toward 0.8066 (top LB).")
print("Our best LB of 0.8045 is already in the top ~7%.")


=== Reality Check: Target Score ===
Target: 0.9642
Top LB: ~0.8066
Our best LB: 0.8045
Gap to top: 0.0021 (0.26%)
Gap to target: 0.1597 (19.8%)

The target of 0.9642 is IMPOSSIBLE.
The realistic goal is to improve toward 0.8066 (top LB).
Our best LB of 0.8045 is already in the top ~7%.


In [8]:
# Summary and recommendation
print("\n" + "="*60)
print("SUMMARY AND RECOMMENDATION")
print("="*60)
print("""
1. We're in a local optimum - 4 experiments failed to beat exp_003
2. CV variance is high (~0.3%) - makes progress detection difficult
3. Feature selection HURT performance - low-importance features still contribute
4. CV-LB gap is increasing - we're overfitting to CV

RECOMMENDED APPROACH: Multi-Seed CatBoost Ensemble
- Train CatBoost with 5 different seeds
- Average predictions
- Expected benefits:
  a) More stable CV estimate
  b) Reduced variance in predictions
  c) Potentially better LB due to reduced overfitting

ALTERNATIVE: Target Encoding
- If multi-seed doesn't help, try target encoding
- This is a fundamentally different approach to feature encoding
- Could capture more signal from categorical features

SUBMISSION STRATEGY:
- 6 submissions remaining
- Only submit if CV > 0.8195 (exp_003's CV)
- Or if multi-seed ensemble shows stable improvement
""")


SUMMARY AND RECOMMENDATION

1. We're in a local optimum - 4 experiments failed to beat exp_003
2. CV variance is high (~0.3%) - makes progress detection difficult
3. Feature selection HURT performance - low-importance features still contribute
4. CV-LB gap is increasing - we're overfitting to CV

RECOMMENDED APPROACH: Multi-Seed CatBoost Ensemble
- Train CatBoost with 5 different seeds
- Average predictions
- Expected benefits:
  a) More stable CV estimate
  b) Reduced variance in predictions
  c) Potentially better LB due to reduced overfitting

ALTERNATIVE: Target Encoding
- If multi-seed doesn't help, try target encoding
- This is a fundamentally different approach to feature encoding
- Could capture more signal from categorical features

SUBMISSION STRATEGY:
- 6 submissions remaining
- Only submit if CV > 0.8195 (exp_003's CV)
- Or if multi-seed ensemble shows stable improvement

