# Loop 50 Analysis: Breaking Out of the Local Optimum

## Key Observations

1. **Last 19 experiments have ALL been worse than exp_030** (CV = 0.008298)
2. **CV-LB relationship**: LB = 4.29*CV + 0.0528 (R²=0.95)
3. **CRITICAL**: Intercept (0.0528) > Target (0.0347) → Target UNREACHABLE via CV improvement alone
4. **Best LB**: 0.0877 from exp_030
5. **Target**: 0.0347

## Analysis Goals

1. Understand WHY the CV-LB relationship has such a high intercept
2. Identify what could CHANGE this relationship
3. Find approaches that haven't been tried

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Submission history
submissions = [
    {'exp': 'exp_000', 'cv': 0.0111, 'lb': 0.0982},
    {'exp': 'exp_001', 'cv': 0.0123, 'lb': 0.1065},
    {'exp': 'exp_003', 'cv': 0.0105, 'lb': 0.0972},
    {'exp': 'exp_005', 'cv': 0.0104, 'lb': 0.0969},
    {'exp': 'exp_006', 'cv': 0.0097, 'lb': 0.0946},
    {'exp': 'exp_007', 'cv': 0.0093, 'lb': 0.0932},
    {'exp': 'exp_009', 'cv': 0.0092, 'lb': 0.0936},
    {'exp': 'exp_012', 'cv': 0.0090, 'lb': 0.0913},
    {'exp': 'exp_024', 'cv': 0.0087, 'lb': 0.0893},
    {'exp': 'exp_026', 'cv': 0.0085, 'lb': 0.0887},
    {'exp': 'exp_030', 'cv': 0.0083, 'lb': 0.0877},
    {'exp': 'exp_035', 'cv': 0.0098, 'lb': 0.0970},
]

df = pd.DataFrame(submissions)
print("Submission History:")
print(df.to_string())

# Fit linear relationship
from scipy import stats
slope, intercept, r_value, p_value, std_err = stats.linregress(df['cv'], df['lb'])
print(f"\nCV-LB Relationship: LB = {slope:.2f}*CV + {intercept:.4f}")
print(f"R² = {r_value**2:.4f}")
print(f"\nTarget LB: 0.0347")
print(f"Intercept: {intercept:.4f}")
print(f"Gap: {intercept - 0.0347:.4f}")

In [None]:
# What CV would be needed to hit target?
target_lb = 0.0347
required_cv = (target_lb - intercept) / slope
print(f"Required CV to hit target: {required_cv:.6f}")
print(f"Current best CV: 0.008298")
print(f"Gap: {0.008298 - required_cv:.6f}")

if required_cv < 0:
    print("\n⚠️ CRITICAL: Required CV is NEGATIVE - target is unreachable with current relationship!")
    print("\nThis means we need to CHANGE the relationship, not just improve CV.")
    print("\nPossible ways to change the relationship:")
    print("1. Use a fundamentally different model architecture")
    print("2. Use different features that generalize better to LB")
    print("3. Use a different training strategy (e.g., transductive learning)")
    print("4. Calibrate predictions specifically for OOD solvents")

In [None]:
# Analyze the residuals from the CV-LB relationship
df['predicted_lb'] = slope * df['cv'] + intercept
df['residual'] = df['lb'] - df['predicted_lb']

print("Residual Analysis:")
print(df[['exp', 'cv', 'lb', 'predicted_lb', 'residual']].to_string())

print(f"\nMean residual: {df['residual'].mean():.6f}")
print(f"Std residual: {df['residual'].std():.6f}")

# Which experiments beat the predicted LB?
df['beats_prediction'] = df['residual'] < 0
print(f"\nExperiments that beat predicted LB:")
print(df[df['beats_prediction']][['exp', 'cv', 'lb', 'predicted_lb', 'residual']])

In [None]:
# Key insight: exp_030 has the best LB but not the best residual
# Let's see which experiments have the best residuals
df_sorted = df.sort_values('residual')
print("Experiments sorted by residual (best first):")
print(df_sorted[['exp', 'cv', 'lb', 'residual']].to_string())

print("\nKey insight: The experiments with best residuals are:")
for _, row in df_sorted.head(3).iterrows():
    print(f"  {row['exp']}: residual = {row['residual']:.6f}, LB = {row['lb']:.4f}")

In [None]:
# What would happen if we could reduce the intercept?
print("Scenario Analysis: What if we could reduce the intercept?")
print("="*60)

for new_intercept in [0.04, 0.03, 0.02, 0.01, 0.0]:
    new_lb = slope * 0.008298 + new_intercept
    print(f"Intercept = {new_intercept:.2f} → LB = {new_lb:.4f} (target: 0.0347)")
    if new_lb < 0.0347:
        print(f"  ✓ Would beat target!")

In [None]:
# What approaches haven't been tried?
print("\nApproaches NOT yet tried (or tried poorly):")
print("="*60)

approaches = [
    ("1. Domain Adaptation", "Train on source domain, adapt to target domain", "NOT TRIED"),
    ("2. Meta-Learning (MAML)", "Learn to quickly adapt to new solvents", "NOT TRIED"),
    ("3. Transductive Learning", "Process test samples jointly with training", "NOT TRIED"),
    ("4. Adversarial Training", "Train to be robust to distribution shift", "NOT TRIED"),
    ("5. Importance Weighting", "Re-weight training samples by density ratio", "NOT TRIED"),
    ("6. Test-Time Adaptation", "Adapt model at test time using unlabeled data", "NOT TRIED"),
    ("7. Proper GNN", "Graph attention network with proper implementation", "TRIED POORLY (exp_040)"),
    ("8. Ensemble of Diverse Architectures", "GNN + MLP + LGBM", "NOT TRIED"),
]

for name, desc, status in approaches:
    print(f"\n{name}")
    print(f"  Description: {desc}")
    print(f"  Status: {status}")

In [None]:
# The key insight from the mixall kernel
print("\nKey Insight from Mixall Kernel:")
print("="*60)
print("""The mixall kernel OVERWRITES the utility functions to use GroupKFold (5 splits)
instead of Leave-One-Out (24 folds). This means:

1. Their local CV scores are NOT comparable to ours
2. The LB evaluation uses the OFFICIAL scheme (Leave-One-Out)
3. Their success on LB is due to their MODEL, not their validation scheme

What the mixall kernel uses:
- MLP + XGBoost + RandomForest + LightGBM ensemble
- Spange descriptors only (no DRFP)
- Optuna hyperparameter tuning
- Weights: [MLP, XGBoost, RF, LightGBM]

Our exp_049 tried this approach but with DRFP features and got CV = 0.014196.
The mixall kernel uses Spange ONLY.

However, our experiments show that DRFP features HELP (exp_003 vs exp_000).
So the mixall approach may not be optimal for us.
""")

In [None]:
# What's the theoretical minimum LB we could achieve?
print("\nTheoretical Analysis:")
print("="*60)

# The GNN benchmark achieved MSE 0.0039
print("GNN Benchmark: MSE = 0.0039")
print("Our best LB: MSE = 0.0877")
print(f"Gap: {0.0877 / 0.0039:.1f}x worse")

print("\nThis proves that MUCH better performance is possible.")
print("The question is: what's different about the GNN benchmark?")
print("\nPossible differences:")
print("1. Different validation scheme (not Leave-One-Out)")
print("2. Different features (molecular graphs vs descriptors)")
print("3. Different architecture (attention mechanisms)")
print("4. Different training strategy")

In [None]:
# Final recommendation
print("\n" + "="*60)
print("FINAL RECOMMENDATION")
print("="*60)

print("""
Given that:
1. Last 19 experiments have all been worse than exp_030
2. The CV-LB relationship has intercept > target
3. All refinements to GP + MLP + LGBM have failed

We need to try something FUNDAMENTALLY DIFFERENT:

1. **PRIORITY 1: Importance Weighting / Domain Adaptation**
   - Re-weight training samples to match test distribution
   - Use adversarial validation to identify drifting features
   - This could CHANGE the CV-LB relationship

2. **PRIORITY 2: Per-Solvent Model Selection**
   - Train multiple models with different characteristics
   - Select the best model for each solvent based on similarity
   - This is different from our current ensemble approach

3. **PRIORITY 3: Simpler is Better**
   - Our best model (exp_030) uses GP + MLP + LGBM
   - Maybe we're overfitting to CV
   - Try a simpler model that generalizes better

4. **PRIORITY 4: Submit exp_030 variations**
   - We have 5 submissions left
   - Use them to test hypotheses about the CV-LB gap
   - Each submission should test a specific hypothesis
""")

In [None]:
# What we know about the high-error solvents
print("\nHigh-Error Solvents Analysis:")
print("="*60)

high_error_solvents = {
    'HFIP': {'cv_error': 0.038, 'cluster': 'Fluorinated'},
    'TFE': {'cv_error': 0.015, 'cluster': 'Fluorinated'},
    'Cyclohexane': {'cv_error': 0.026, 'cluster': 'Outlier (non-polar)'},
    'Acetonitrile.Acetic Acid': {'cv_error': 0.022, 'cluster': 'Polar'},
    'Water.Ethanol': {'cv_error': 0.028, 'cluster': 'Polar protic'},
}

print("High-error solvents and their characteristics:")
for solvent, info in high_error_solvents.items():
    print(f"  {solvent}: CV error = {info['cv_error']:.3f}, Cluster = {info['cluster']}")

print("\nKey insight: The high-error solvents are CHEMICALLY DIFFERENT from the training data.")
print("This suggests that the model is not generalizing well to OOD solvents.")
print("\nPossible solutions:")
print("1. Use simpler features that capture fundamental chemistry")
print("2. Use domain adaptation to bridge the gap")
print("3. Use ensemble disagreement to identify OOD samples")
print("4. Use transductive learning to adapt to new solvents")