# Loop 44 Analysis: Post-Stacking Strategy

**Key Results from exp_045 (Stacking)**:
- CV: 0.010001 (22% worse than best CV 0.008194)
- Predicted LB using old relationship: 0.0956

**Critical Question**: Does stacking have a different CV-LB relationship?

**Submission History Analysis**:
- 13 submissions made, 3 remaining
- Best LB: 0.0877 (exp_030)
- Target: 0.0347 (2.53x away)

In [None]:
import numpy as np
import pandas as pd
from scipy import stats

# Complete submission history
submissions = [
    {'exp': 'exp_000', 'cv': 0.0111, 'lb': 0.0982, 'model': 'MLP'},
    {'exp': 'exp_001', 'cv': 0.0123, 'lb': 0.1065, 'model': 'LGBM'},
    {'exp': 'exp_003', 'cv': 0.0105, 'lb': 0.0972, 'model': 'MLP+DRFP'},
    {'exp': 'exp_005', 'cv': 0.0104, 'lb': 0.0969, 'model': 'MLP Ensemble'},
    {'exp': 'exp_006', 'cv': 0.0097, 'lb': 0.0946, 'model': 'Simple MLP'},
    {'exp': 'exp_007', 'cv': 0.0093, 'lb': 0.0932, 'model': 'Even Simpler'},
    {'exp': 'exp_009', 'cv': 0.0092, 'lb': 0.0936, 'model': 'Ridge'},
    {'exp': 'exp_012', 'cv': 0.0090, 'lb': 0.0913, 'model': 'Simple Ensemble'},
    {'exp': 'exp_024', 'cv': 0.0087, 'lb': 0.0893, 'model': 'ACS PCA'},
    {'exp': 'exp_026', 'cv': 0.0085, 'lb': 0.0887, 'model': 'Weighted Loss'},
    {'exp': 'exp_030', 'cv': 0.0083, 'lb': 0.0877, 'model': 'GP Ensemble'},
    {'exp': 'exp_041', 'cv': 0.0090, 'lb': 0.0932, 'model': 'Aggressive Reg'},
    {'exp': 'exp_042', 'cv': 0.0145, 'lb': 0.1147, 'model': 'Pure GP'},
]

df = pd.DataFrame(submissions)
print('All Submissions:')
print(df.to_string(index=False))
print(f'\nBest CV: {df["cv"].min():.4f} ({df.loc[df["cv"].idxmin(), "exp"]})')
print(f'Best LB: {df["lb"].min():.4f} ({df.loc[df["lb"].idxmin(), "exp"]})')

In [None]:
# Fit CV-LB relationship
cv_vals = df['cv'].values
lb_vals = df['lb'].values

slope, intercept, r_value, p_value, std_err = stats.linregress(cv_vals, lb_vals)

print(f'CV-LB Relationship: LB = {slope:.2f} × CV + {intercept:.4f}')
print(f'R² = {r_value**2:.4f}')
print(f'\nIntercept: {intercept:.4f}')
print(f'Target: 0.0347')
print(f'Gap: {intercept - 0.0347:.4f}')
print(f'\nTo reach target with current relationship:')
required_cv = (0.0347 - intercept) / slope
print(f'Required CV: {required_cv:.4f} (IMPOSSIBLE - negative)')

In [None]:
# Analyze exp_045 (Stacking)
stacking_cv = 0.010001
predicted_lb = slope * stacking_cv + intercept

print(f'=== exp_045 (Stacking) Analysis ===')
print(f'CV: {stacking_cv:.6f}')
print(f'Predicted LB (using relationship): {predicted_lb:.4f}')
print(f'Best LB so far: 0.0877')
print(f'\nIf actual LB < {predicted_lb:.4f}: Stacking has LOWER intercept (promising!)')
print(f'If actual LB ≈ {predicted_lb:.4f}: Stacking follows SAME relationship (not helpful)')
print(f'If actual LB > {predicted_lb:.4f}: Stacking has HIGHER intercept (worse)')

In [None]:
# What approaches haven't been tried?
print('=== APPROACHES TRIED ===')
print('1. MLP (various architectures) - ALL ON SAME LINE')
print('2. LightGBM - ON SAME LINE')
print('3. Ridge Regression - ON SAME LINE')
print('4. Gaussian Process - ON SAME LINE (exp_042)')
print('5. k-NN - ON SAME LINE (worse CV)')
print('6. Simple Ensemble (averaging) - ON SAME LINE')
print('7. Aggressive Regularization - ON SAME LINE (exp_041)')
print('8. Stacking (exp_045) - CV 0.010001, LB UNKNOWN')

print('\n=== APPROACHES NOT TRIED ===')
print('1. XGBoost (different from LightGBM)')
print('2. CatBoost')
print('3. Neural Network with different architecture (Transformer, Attention)')
print('4. Domain Adaptation (adversarial training)')
print('5. Sample Weighting (based on similarity to test)')
print('6. Different Feature Engineering (fragprints, MACCS keys)')
print('7. Pre-training on related data')
print('8. Bayesian Neural Network')

In [None]:
# Analyze the CV-LB ratio trend
df['cv_lb_ratio'] = df['lb'] / df['cv']
df_sorted = df.sort_values('cv')

print('CV-LB Ratio by CV (sorted):')
print(df_sorted[['exp', 'cv', 'lb', 'cv_lb_ratio', 'model']].to_string(index=False))

print(f'\nKey Observation:')
print(f'As CV improves (decreases), the ratio INCREASES.')
print(f'This is because LB = slope × CV + intercept, where intercept > 0.')
print(f'When CV is small, the intercept dominates, making ratio = LB/CV larger.')

In [None]:
# What would it take to reach the target?
target = 0.0347

print('=== REACHING THE TARGET ===')
print(f'Target: {target}')
print(f'Current best LB: 0.0877')
print(f'Gap: {0.0877 / target:.2f}x')

print(f'\n=== SCENARIO ANALYSIS ===')

# Scenario 1: Reduce intercept
print(f'\nScenario 1: Reduce intercept (keep slope)')
for new_intercept in [0.04, 0.03, 0.02, 0.01, 0.00]:
    required_cv = (target - new_intercept) / slope
    print(f'  Intercept {new_intercept:.2f}: Required CV = {required_cv:.4f}')

# Scenario 2: Reduce slope
print(f'\nScenario 2: Reduce slope (keep intercept)')
for new_slope in [3.0, 2.0, 1.0, 0.5]:
    required_cv = (target - intercept) / new_slope
    print(f'  Slope {new_slope:.1f}: Required CV = {required_cv:.4f} (impossible if negative)')

# Scenario 3: Both
print(f'\nScenario 3: Reduce both')
print(f'  If slope=2.0, intercept=0.02: Required CV = {(target - 0.02) / 2.0:.4f}')
print(f'  If slope=1.0, intercept=0.02: Required CV = {(target - 0.02) / 1.0:.4f}')
print(f'  If slope=1.0, intercept=0.01: Required CV = {(target - 0.01) / 1.0:.4f}')

In [None]:
# Analyze what makes the best models different
print('=== BEST MODELS ANALYSIS ===')

best_models = df.nsmallest(5, 'lb')
print('Top 5 by LB:')
print(best_models[['exp', 'cv', 'lb', 'model']].to_string(index=False))

print('\nCommon characteristics of best models:')
print('1. GP Ensemble (exp_030): GP 0.15 + MLP 0.55 + LGBM 0.3')
print('2. Weighted Loss (exp_026): Per-target loss weighting')
print('3. ACS PCA (exp_024): Added ACS PCA features')
print('4. Simple Ensemble (exp_012): Simple averaging')
print('5. Ridge (exp_009): Ridge regression')

print('\nKey insight: The best models are DIVERSE ensembles with GP component.')
print('GP provides uncertainty estimates and different extrapolation behavior.')

In [None]:
# Strategic decision for remaining 3 submissions
print('=== STRATEGIC DECISION ===')
print(f'Submissions remaining: 3')
print(f'Target: {target}')
print(f'Best LB: 0.0877')
print(f'Gap: {0.0877 / target:.2f}x')

print('\n=== OPTIONS ===')

print('\n1. Submit exp_045 (Stacking)')
print(f'   CV: 0.010001')
print(f'   Predicted LB: {slope * 0.010001 + intercept:.4f}')
print('   Purpose: Test if stacking has different CV-LB relationship')
print('   Risk: LB might be worse than best')
print('   Reward: If stacking has lower intercept, we have a path forward')

print('\n2. Try XGBoost/CatBoost ensemble')
print('   Purpose: Test if different boosting algorithms have different relationship')
print('   Risk: Likely same relationship as LGBM')
print('   Reward: Might find a model with lower intercept')

print('\n3. Try domain adaptation')
print('   Purpose: Reduce distribution shift')
print('   Risk: Complex to implement correctly')
print('   Reward: Could fundamentally change the CV-LB relationship')

print('\n4. Try different feature engineering')
print('   Purpose: Find features that generalize better')
print('   Risk: Time-consuming')
print('   Reward: Could reduce intercept if features capture test distribution better')

print('\n=== RECOMMENDATION ===')
print('DO NOT submit exp_045 yet.')
print('The CV is 22% worse than best, and if it follows the same relationship,')
print('the LB will be ~0.0956, which is 9% WORSE than best LB (0.0877).')
print('')
print('Instead, try one more approach that might have a different CV-LB relationship:')
print('1. XGBoost + CatBoost ensemble (different boosting algorithms)')
print('2. Or: Feature engineering with domain-specific features')
print('3. Or: Sample weighting based on similarity to test distribution')