# Loop 4 Analysis: Post-Optuna Tuning Assessment

## Goals:
1. Analyze the CV-LB gap and what it means for our predictions
2. Explore threshold tuning potential
3. Assess feature selection opportunities
4. Evaluate stacking/pseudo-labeling potential

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_recall_curve, f1_score
from sklearn.model_selection import StratifiedKFold
import warnings
warnings.filterwarnings('ignore')

# Load data
train = pd.read_csv('/home/data/train.csv')
test = pd.read_csv('/home/data/test.csv')
print(f"Train: {train.shape}, Test: {test.shape}")
print(f"Target distribution: {train['Transported'].mean():.4f} transported")

Train: (8693, 14), Test: (4277, 13)
Target distribution: 0.5036 transported


In [2]:
# Analyze CV-LB gap
print("=" * 60)
print("CV-LB GAP ANALYSIS")
print("=" * 60)

# From session state
submissions = [
    {'exp': 'exp_000', 'cv': 0.80674, 'lb': 0.79705, 'model': 'XGBoost Baseline'}
]

for sub in submissions:
    gap = sub['cv'] - sub['lb']
    gap_pct = (gap / sub['cv']) * 100
    print(f"{sub['model']}:")
    print(f"  CV: {sub['cv']:.5f}, LB: {sub['lb']:.5f}")
    print(f"  Gap: {gap:+.5f} ({gap_pct:.2f}% overestimate)")
    print()

# Predict LB for current best
cv_best = 0.81951
predicted_lb = cv_best - 0.00969  # Using observed gap
print(f"Current best CV: {cv_best:.5f}")
print(f"Predicted LB (using 0.97% gap): {predicted_lb:.5f}")
print(f"\nTop LB scores are ~0.8066")
print(f"Our predicted LB would be competitive!")
print(f"\nNote: Target of 0.9642 is UNREALISTIC - impossible to achieve")

CV-LB GAP ANALYSIS
XGBoost Baseline:
  CV: 0.80674, LB: 0.79705
  Gap: +0.00969 (1.20% overestimate)

Current best CV: 0.81951
Predicted LB (using 0.97% gap): 0.80982

Top LB scores are ~0.8066
Our predicted LB would be competitive!

Note: Target of 0.9642 is UNREALISTIC - impossible to achieve


In [3]:
# Analyze threshold tuning potential
# Load the OOF predictions from exp_004 if available
import os

# Check what files we have
exp_dirs = ['experiments/001_baseline', 'experiments/002_feature_engineering', 
            'experiments/003_ensemble', 'experiments/004_catboost_only']

for exp_dir in exp_dirs:
    full_path = f'/home/code/{exp_dir}'
    if os.path.exists(full_path):
        files = os.listdir(full_path)
        print(f"{exp_dir}: {files}")

experiments/001_baseline: ['baseline.ipynb']
experiments/002_feature_engineering: ['feature_eng.ipynb']
experiments/003_ensemble: ['ensemble.ipynb']
experiments/004_catboost_only: ['catboost_tuned.ipynb']


In [4]:
# Since we don't have saved OOF predictions, let's analyze threshold tuning theoretically
# The target distribution is ~50.4% transported

target_rate = train['Transported'].mean()
print(f"Training target rate: {target_rate:.4f}")
print(f"\nThreshold tuning analysis:")
print(f"- Default threshold: 0.5")
print(f"- If model is well-calibrated, threshold ~0.5 should be optimal")
print(f"- If model overestimates probabilities, lower threshold helps")
print(f"- If model underestimates, higher threshold helps")
print(f"\nTo properly tune threshold, we need OOF predictions.")
print(f"The evaluator suggests this could give 0.1-0.3% improvement.")

Training target rate: 0.5036

Threshold tuning analysis:
- Default threshold: 0.5
- If model is well-calibrated, threshold ~0.5 should be optimal
- If model overestimates probabilities, lower threshold helps
- If model underestimates, higher threshold helps

To properly tune threshold, we need OOF predictions.
The evaluator suggests this could give 0.1-0.3% improvement.


In [5]:
# Analyze what approaches haven't been tried
print("=" * 60)
print("UNEXPLORED APPROACHES")
print("=" * 60)

approaches = {
    'Threshold tuning': 'NOT TRIED - Quick win, needs OOF predictions',
    'Feature selection': 'NOT TRIED - 56 features may include noise',
    'Stacking': 'NOT TRIED - Use OOF predictions as meta-features',
    'Pseudo-labeling': 'NOT TRIED - Use high-confidence test predictions',
    'Nested CV': 'NOT TRIED - More robust hyperparameter tuning',
    'Different CV scheme': 'NOT TRIED - GroupKFold based on passenger groups',
    'Target encoding': 'NOT TRIED - For high-cardinality categoricals',
    'Name features': 'NOT TRIED - Surname clustering, family size',
    'Neural network': 'LOW PRIORITY - GBMs typically better for small tabular data'
}

for approach, status in approaches.items():
    print(f"- {approach}: {status}")

UNEXPLORED APPROACHES
- Threshold tuning: NOT TRIED - Quick win, needs OOF predictions
- Feature selection: NOT TRIED - 56 features may include noise
- Stacking: NOT TRIED - Use OOF predictions as meta-features
- Pseudo-labeling: NOT TRIED - Use high-confidence test predictions
- Nested CV: NOT TRIED - More robust hyperparameter tuning
- Different CV scheme: NOT TRIED - GroupKFold based on passenger groups
- Target encoding: NOT TRIED - For high-cardinality categoricals
- Name features: NOT TRIED - Surname clustering, family size
- Neural network: LOW PRIORITY - GBMs typically better for small tabular data


In [6]:
# Analyze the variance concern from evaluator
print("=" * 60)
print("VARIANCE ANALYSIS")
print("=" * 60)

models = {
    'CatBoost Baseline': {'cv': 0.81836, 'std': 0.00431},
    'CatBoost Tuned': {'cv': 0.81951, 'std': 0.00685}
}

for name, stats in models.items():
    print(f"{name}:")
    print(f"  CV: {stats['cv']:.5f} (+/- {stats['std']:.5f})")
    print(f"  95% CI: [{stats['cv'] - 1.96*stats['std']:.5f}, {stats['cv'] + 1.96*stats['std']:.5f}]")
    print()

print("Evaluator concern: Tuned model has 59% higher variance")
print("This suggests tuned model may be less stable on unseen data")
print("\nOptions:")
print("1. Submit tuned model to verify LB performance")
print("2. Average baseline and tuned predictions for stability")
print("3. Use baseline model (lower variance, more stable)")

VARIANCE ANALYSIS
CatBoost Baseline:
  CV: 0.81836 (+/- 0.00431)
  95% CI: [0.80991, 0.82681]

CatBoost Tuned:
  CV: 0.81951 (+/- 0.00685)
  95% CI: [0.80608, 0.83294]

Evaluator concern: Tuned model has 59% higher variance
This suggests tuned model may be less stable on unseen data

Options:
1. Submit tuned model to verify LB performance
2. Average baseline and tuned predictions for stability
3. Use baseline model (lower variance, more stable)


In [7]:
# Strategic assessment
print("=" * 60)
print("STRATEGIC ASSESSMENT")
print("=" * 60)

print("\n1. TARGET SCORE REALITY:")
print(f"   - Target: 0.9642 is IMPOSSIBLE")
print(f"   - Top LB: ~0.8066 (80.7%)")
print(f"   - Our best CV: 0.81951")
print(f"   - Predicted LB: ~0.8098")
print(f"   - We are likely in TOP 5% territory!")

print("\n2. SUBMISSION STRATEGY:")
print(f"   - Used: 1/10 submissions")
print(f"   - Remaining: 9 submissions")
print(f"   - Should submit to calibrate CV-LB gap")

print("\n3. NEXT PRIORITIES:")
print(f"   a) Submit exp_003 (tuned CatBoost) to get LB feedback")
print(f"   b) Try threshold tuning on OOF predictions")
print(f"   c) Try feature selection to reduce noise")
print(f"   d) Try stacking with meta-learner")

print("\n4. DIMINISHING RETURNS:")
print(f"   - Optuna tuning gave only +0.14% improvement")
print(f"   - Further hyperparameter tuning unlikely to help much")
print(f"   - Focus should shift to feature engineering or ensembling")

STRATEGIC ASSESSMENT

1. TARGET SCORE REALITY:
   - Target: 0.9642 is IMPOSSIBLE
   - Top LB: ~0.8066 (80.7%)
   - Our best CV: 0.81951
   - Predicted LB: ~0.8098
   - We are likely in TOP 5% territory!

2. SUBMISSION STRATEGY:
   - Used: 1/10 submissions
   - Remaining: 9 submissions
   - Should submit to calibrate CV-LB gap

3. NEXT PRIORITIES:
   a) Submit exp_003 (tuned CatBoost) to get LB feedback
   b) Try threshold tuning on OOF predictions
   c) Try feature selection to reduce noise
   d) Try stacking with meta-learner

4. DIMINISHING RETURNS:
   - Optuna tuning gave only +0.14% improvement
   - Further hyperparameter tuning unlikely to help much
   - Focus should shift to feature engineering or ensembling
