# Loop 58 Analysis: Post-Simpler Model Assessment

**Situation:**
- 58 experiments completed, 27 consecutive failures since exp_030
- Best LB: 0.0877 (exp_030), Target: 0.0707
- Gap: 1.24x (0.0877 / 0.0707) = 19.4% improvement needed
- 5 submissions remaining
- exp_057 (Simpler Model with Spange Only) FAILED - CV 0.023017 (177.4% worse)

**Critical Evaluator Insight:**
- The target IS reachable! Intercept (0.0525) < Target (0.0707)
- Required CV to hit target: 0.00422 (49% improvement from current best 0.008298)
- The 'mixall' kernel uses GroupKFold(5) instead of Leave-One-Out(24)

**Questions:**
1. What approaches haven't been tried?
2. How can we reduce CV by 49%?
3. What's the path to LB 0.0707?

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

# Submission history
submissions = [
    {'exp': 'exp_000', 'cv': 0.0111, 'lb': 0.0982},
    {'exp': 'exp_001', 'cv': 0.0123, 'lb': 0.1065},
    {'exp': 'exp_003', 'cv': 0.0105, 'lb': 0.0972},
    {'exp': 'exp_005', 'cv': 0.0104, 'lb': 0.0969},
    {'exp': 'exp_006', 'cv': 0.0097, 'lb': 0.0946},
    {'exp': 'exp_007', 'cv': 0.0093, 'lb': 0.0932},
    {'exp': 'exp_009', 'cv': 0.0092, 'lb': 0.0936},
    {'exp': 'exp_012', 'cv': 0.0090, 'lb': 0.0913},
    {'exp': 'exp_024', 'cv': 0.0087, 'lb': 0.0893},
    {'exp': 'exp_026', 'cv': 0.0085, 'lb': 0.0887},
    {'exp': 'exp_030', 'cv': 0.0083, 'lb': 0.0877},
    {'exp': 'exp_035', 'cv': 0.0098, 'lb': 0.0970},
]

df = pd.DataFrame(submissions)
print("Submission History:")
print(df.to_string(index=False))
print(f"\nTarget LB: 0.0707")
print(f"Best LB: {df['lb'].min():.4f} ({df.loc[df['lb'].idxmin(), 'exp']})")
print(f"Gap to target: {df['lb'].min() / 0.0707:.2f}x ({(df['lb'].min() - 0.0707) / 0.0707 * 100:.1f}% improvement needed)")

In [None]:
# CV-LB relationship analysis
cv = df['cv'].values
lb = df['lb'].values

slope, intercept, r_value, p_value, std_err = stats.linregress(cv, lb)

print(f"CV-LB Linear Relationship:")
print(f"  LB = {slope:.2f} * CV + {intercept:.4f}")
print(f"  R² = {r_value**2:.4f}")
print(f"  Intercept = {intercept:.4f}")
print(f"  Target LB = 0.0707")
print(f"")
print(f"CRITICAL INSIGHT:")
print(f"  Intercept ({intercept:.4f}) < Target ({0.0707})")
print(f"  This means the target IS REACHABLE!")
print(f"")
print(f"Required CV to hit target:")
required_cv = (0.0707 - intercept) / slope
print(f"  CV = (0.0707 - {intercept:.4f}) / {slope:.2f} = {required_cv:.6f}")
print(f"  Current best CV: 0.008298")
print(f"  Required improvement: {(0.008298 - required_cv) / 0.008298 * 100:.1f}%")

In [None]:
# Analyze residuals - which experiments performed better/worse than expected?
df['predicted_lb'] = slope * df['cv'] + intercept
df['residual'] = df['lb'] - df['predicted_lb']

print("CV-LB Residual Analysis:")
print("(Negative residual = performed BETTER on LB than expected from CV)")
print("")
for _, row in df.sort_values('residual').iterrows():
    print(f"  {row['exp']}: CV={row['cv']:.4f}, LB={row['lb']:.4f}, Predicted={row['predicted_lb']:.4f}, Residual={row['residual']:+.4f}")

print(f"\nBest residual: {df.loc[df['residual'].idxmin(), 'exp']} ({df['residual'].min():.4f})")
print(f"Worst residual: {df.loc[df['residual'].idxmax(), 'exp']} ({df['residual'].max():.4f})")

In [None]:
# What approaches have been tried?
print("="*60)
print("APPROACHES TRIED (58 experiments)")
print("="*60)

approaches = [
    "MLP with Arrhenius kinetics (exp_000, exp_006, exp_007)",
    "LightGBM (exp_001)",
    "DRFP features with PCA (exp_002)",
    "Combined Spange + DRFP (exp_003, exp_005)",
    "Deep Residual MLP (exp_004) - FAILED",
    "Large Ensemble 15 models (exp_005)",
    "Simpler models [64,32] (exp_006, exp_007, exp_008)",
    "Ridge Regression (exp_009, exp_033)",
    "Single layer 16 (exp_010)",
    "Diverse Ensemble (exp_011, exp_047)",
    "Simple Ensemble (exp_012)",
    "Compliant Ensemble (exp_013)",
    "Ensemble weight tuning (exp_014, exp_031, exp_035, exp_036)",
    "Three model ensemble (exp_015)",
    "Attention model (exp_017)",
    "Fragprints (exp_018)",
    "ACS PCA features (exp_019, exp_023, exp_024)",
    "Per-target models (exp_025)",
    "Weighted loss (exp_026)",
    "Simple features (exp_027)",
    "Four model ensemble (exp_028)",
    "Normalization (exp_029)",
    "GP Ensemble (exp_030) - BEST",
    "Higher GP weight (exp_031, exp_035)",
    "Pure GP (exp_032)",
    "Kernel Ridge (exp_034)",
    "Similarity weighting (exp_037)",
    "Minimal features (exp_038)",
    "Learned embeddings (exp_039)",
    "GNN architectures (exp_040, exp_052)",
    "ChemBERTa (exp_041)",
    "Calibration (exp_042)",
    "Nonlinear mixture (exp_043)",
    "Hybrid model (exp_044)",
    "Mean reversion (exp_045)",
    "Adaptive weighting (exp_046)",
    "Hybrid features (exp_048)",
    "Manual OOD handling (exp_049)",
    "LISA/REX (exp_050)",
    "Simpler model (exp_051, exp_054)",
    "mixall full features (exp_053)",
    "Chemical constraints (exp_055)",
    "XGBoost + RF Ensemble (exp_056)",
    "Simpler Spange Only (exp_057)",
]

for i, approach in enumerate(approaches, 1):
    print(f"  {i}. {approach}")

In [None]:
# What approaches HAVEN'T been tried?
print("="*60)
print("APPROACHES NOT YET TRIED")
print("="*60)

untried = [
    "1. PREDICTION CALIBRATION (Isotonic Regression)",
    "   - Train best model (exp_030)",
    "   - Use CV predictions to fit isotonic regression",
    "   - Apply calibration to test predictions",
    "   - Explicitly corrects systematic bias",
    "",
    "2. IMPORTANCE WEIGHTING",
    "   - Weight training samples by similarity to test distribution",
    "   - Use adversarial validation to identify drifting features",
    "   - Down-weight samples that are far from test distribution",
    "",
    "3. DOMAIN ADAPTATION",
    "   - Adapt model to test distribution at inference time",
    "   - Use test-time training (TTT) or transductive learning",
    "   - Fine-tune on test data without labels",
    "",
    "4. CATBOOST",
    "   - Different gradient boosting implementation",
    "   - Handles categorical features natively",
    "   - May have different inductive biases",
    "",
    "5. NEURAL NETWORK ENSEMBLES WITH DIFFERENT ARCHITECTURES",
    "   - Train multiple MLPs with different architectures",
    "   - Use different activation functions (GELU, SiLU)",
    "   - Use different regularization (LayerNorm, GroupNorm)",
    "",
    "6. QUANTILE REGRESSION",
    "   - Train model with quantile loss (median)",
    "   - May produce more robust predictions",
    "   - Different loss function could change CV-LB relationship",
]

for line in untried:
    print(line)

In [None]:
# Analyze the mixall kernel approach
print("="*60)
print("ANALYSIS: The 'mixall' Kernel Approach")
print("="*60)

print("""
The 'mixall' kernel achieves good LB scores using:

1. ENSEMBLE: MLP (0.4) + XGBoost (0.2) + RandomForest (0.2) + LightGBM (0.2)
   - Our exp_056 tried XGBoost + RF but FAILED
   - Key difference: mixall uses different weights and architecture

2. FEATURES: Spange descriptors + Residence Time + Temperature
   - Simple features, no DRFP
   - Our exp_057 tried this but FAILED

3. CV SCHEME: GroupKFold(5) instead of Leave-One-Out(24)
   - This is a GRAY AREA in competition rules
   - Their local CV is not comparable to ours
   - But their model may still generalize better

4. ARCHITECTURE: MLP [128, 64, 32] with dropout 0.1
   - Similar to our exp_006, exp_007
   - Not fundamentally different

KEY INSIGHT:
The mixall kernel's success is NOT due to a fundamentally different approach.
It's likely due to:
  a) Different hyperparameters
  b) Different random seeds
  c) Different training dynamics
  d) Luck in the CV-LB relationship
""")

print("\nOur best model (exp_030) uses:")
print("  - GP (0.15) + MLP (0.55) + LGBM (0.30)")
print("  - Spange + DRFP + Arrhenius features")
print("  - CV: 0.008298, LB: 0.0877")
print("\nTo reach target LB 0.0707, we need:")
print(f"  - CV: {required_cv:.6f} (49% improvement)")
print(f"  - Or change the CV-LB relationship (reduce intercept)")

In [None]:
# Strategic analysis
print("="*60)
print("STRATEGIC ANALYSIS")
print("="*60)

print("""
CURRENT SITUATION:
- 27 consecutive failures since exp_030
- Best LB: 0.0877 (exp_030)
- Target: 0.0707 (19.4% improvement needed)
- 5 submissions remaining

THE PROBLEM:
- We've tried many approaches but none beat exp_030
- The CV-LB relationship is: LB = 4.31*CV + 0.0525
- To reach target, we need CV = 0.00422 (49% improvement)

THE PATH FORWARD:

1. FOCUS ON CV IMPROVEMENT
   - Current best CV: 0.008298
   - Required CV: 0.00422
   - This is a 49% improvement - very aggressive
   - Need fundamentally better features or models

2. FOCUS ON CV-LB RELATIONSHIP
   - The intercept (0.0525) is the systematic bias
   - Prediction calibration could reduce this
   - Importance weighting could reduce this

3. SUBMISSION STRATEGY
   - 5 submissions remaining
   - Use 2-3 for experiments that might change CV-LB relationship
   - Save 2 for final attempts

RECOMMENDED PRIORITIES:
1. Prediction Calibration (Isotonic Regression) - directly addresses intercept
2. Importance Weighting - addresses distribution shift
3. CatBoost - different inductive biases
4. Quantile Regression - different loss function
""")

print("\nNOTE: The evaluator says the target IS reachable.")
print("The intercept (0.0525) < Target (0.0707) means we CAN reach it.")
print("We just need to improve CV by 49% or reduce the intercept.")

In [None]:
# Final summary
print("="*60)
print("LOOP 58 SUMMARY")
print("="*60)

print("""
Current Status:
  - Best CV: 0.008298 (exp_030)
  - Best LB: 0.0877 (exp_030)
  - Target LB: 0.0707
  - Gap: 19.4% improvement needed
  - Submissions remaining: 5
  - Consecutive failures: 27

Key Findings:
  1. CV-LB relationship: LB = 4.31*CV + 0.0525 (R²=0.95)
  2. Intercept (0.0525) < Target (0.0707) - target IS reachable
  3. Required CV to hit target: 0.00422 (49% improvement)
  4. exp_057 (Simpler Spange Only) FAILED - CV 0.023017 (177.4% worse)
  5. The 'mixall' kernel uses GroupKFold(5) - not directly comparable

Recommended Next Steps:
  1. Prediction Calibration (Isotonic Regression) - directly addresses intercept
  2. Importance Weighting - addresses distribution shift
  3. CatBoost - different inductive biases
  4. Quantile Regression - different loss function

Submission Strategy:
  - Use 2-3 submissions for experiments that might change CV-LB relationship
  - Save 2 submissions for final attempts
  - Focus on approaches that reduce the intercept, not just CV
""")