# Loop 54 Analysis: Breaking the CV-LB Ceiling

## Current State
- **Best CV**: 0.008298 (exp_030)
- **Best LB**: 0.08772 (exp_030)
- **Target**: 0.0347
- **Gap**: 2.53x (0.08772 / 0.0347)
- **23 consecutive failures** since exp_030

## Key Insight from Evaluator
The CV-LB relationship is: LB = 4.31*CV + 0.0525
- Intercept (0.0525) > Target (0.0347)
- This means the target is UNREACHABLE by improving CV alone
- We need to CHANGE the relationship, not just improve CV

## Analysis Goals
1. Understand what could change the CV-LB relationship
2. Identify unexplored approaches
3. Find the path to 0.0347

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

# Submission history
submissions = [
    {'exp': 'exp_000', 'cv': 0.0111, 'lb': 0.0982, 'model': 'MLP baseline'},
    {'exp': 'exp_001', 'cv': 0.0123, 'lb': 0.1065, 'model': 'LightGBM'},
    {'exp': 'exp_003', 'cv': 0.0105, 'lb': 0.0972, 'model': 'Spange+DRFP MLP'},
    {'exp': 'exp_005', 'cv': 0.0104, 'lb': 0.0969, 'model': 'Large ensemble'},
    {'exp': 'exp_006', 'cv': 0.0097, 'lb': 0.0946, 'model': 'Simpler MLP'},
    {'exp': 'exp_007', 'cv': 0.0093, 'lb': 0.0932, 'model': 'Even simpler'},
    {'exp': 'exp_009', 'cv': 0.0092, 'lb': 0.0936, 'model': 'Ridge regression'},
    {'exp': 'exp_012', 'cv': 0.0090, 'lb': 0.0913, 'model': 'Simple ensemble'},
    {'exp': 'exp_024', 'cv': 0.0087, 'lb': 0.0893, 'model': 'ACS PCA'},
    {'exp': 'exp_026', 'cv': 0.0085, 'lb': 0.0887, 'model': 'Weighted loss'},
    {'exp': 'exp_030', 'cv': 0.0083, 'lb': 0.0877, 'model': 'GP+MLP+LGBM'},
    {'exp': 'exp_035', 'cv': 0.0098, 'lb': 0.0970, 'model': 'Lower GP weight'},
]

df = pd.DataFrame(submissions)
print("Submission History:")
print(df.to_string())

# Fit linear relationship
slope, intercept, r_value, p_value, std_err = stats.linregress(df['cv'], df['lb'])
print(f"\nCV-LB Relationship: LB = {slope:.2f}*CV + {intercept:.4f}")
print(f"R² = {r_value**2:.4f}")
print(f"\nTarget LB: 0.0347")
print(f"Intercept: {intercept:.4f}")
print(f"Gap: {intercept - 0.0347:.4f}")

Submission History:
        exp      cv      lb             model
0   exp_000  0.0111  0.0982      MLP baseline
1   exp_001  0.0123  0.1065          LightGBM
2   exp_003  0.0105  0.0972   Spange+DRFP MLP
3   exp_005  0.0104  0.0969    Large ensemble
4   exp_006  0.0097  0.0946       Simpler MLP
5   exp_007  0.0093  0.0932      Even simpler
6   exp_009  0.0092  0.0936  Ridge regression
7   exp_012  0.0090  0.0913   Simple ensemble
8   exp_024  0.0087  0.0893           ACS PCA
9   exp_026  0.0085  0.0887     Weighted loss
10  exp_030  0.0083  0.0877       GP+MLP+LGBM
11  exp_035  0.0098  0.0970   Lower GP weight

CV-LB Relationship: LB = 4.31*CV + 0.0525
R² = 0.9505

Target LB: 0.0347
Intercept: 0.0525
Gap: 0.0178


In [2]:
# What CV would be needed to hit target?
target_lb = 0.0347
required_cv = (target_lb - intercept) / slope
print(f"Required CV to hit target: {required_cv:.6f}")
print(f"Current best CV: 0.008298")

if required_cv < 0:
    print("\n⚠️ CRITICAL: Required CV is NEGATIVE - target is unreachable with current relationship!")
    print("\nThis means we need to CHANGE the relationship, not just improve CV.")
    print("\nPossible ways to change the relationship:")
    print("1. Use a fundamentally different model architecture")
    print("2. Use different features that generalize better to LB")
    print("3. Use a different training strategy (e.g., transductive learning)")
    print("4. Calibrate predictions specifically for OOD solvents")

Required CV to hit target: -0.004130
Current best CV: 0.008298

⚠️ CRITICAL: Required CV is NEGATIVE - target is unreachable with current relationship!

This means we need to CHANGE the relationship, not just improve CV.

Possible ways to change the relationship:
1. Use a fundamentally different model architecture
2. Use different features that generalize better to LB
3. Use a different training strategy (e.g., transductive learning)
4. Calibrate predictions specifically for OOD solvents


In [3]:
# Analyze what approaches have been tried
print("\n" + "="*60)
print("APPROACHES TRIED (54 experiments)")
print("="*60)

approaches_tried = {
    'MLP architectures': ['baseline', 'simpler [64,32]', 'deeper [256,128,64]', 'residual', 'attention'],
    'Tree-based models': ['LightGBM', 'XGBoost', 'RandomForest'],
    'Gaussian Processes': ['GP with RBF kernel', 'GP with Matern kernel'],
    'Ensembles': ['MLP bagging', 'GP+MLP+LGBM', 'MLP+XGB+RF+LGBM'],
    'Feature sets': ['Spange only', 'DRFP only', 'Spange+DRFP', 'ACS PCA', 'Arrhenius kinetics'],
    'Regularization': ['Dropout 0.1-0.4', 'Weight decay 1e-5 to 1e-3', 'L1/L2'],
    'Loss functions': ['MSE', 'Huber', 'Weighted MSE'],
    'Data augmentation': ['TTA for mixtures', 'Both orderings'],
}

for category, items in approaches_tried.items():
    print(f"\n{category}:")
    for item in items:
        print(f"  - {item}")


APPROACHES TRIED (54 experiments)

MLP architectures:
  - baseline
  - simpler [64,32]
  - deeper [256,128,64]
  - residual
  - attention

Tree-based models:
  - LightGBM
  - XGBoost
  - RandomForest

Gaussian Processes:
  - GP with RBF kernel
  - GP with Matern kernel

Ensembles:
  - MLP bagging
  - GP+MLP+LGBM
  - MLP+XGB+RF+LGBM

Feature sets:
  - Spange only
  - DRFP only
  - Spange+DRFP
  - ACS PCA
  - Arrhenius kinetics

Regularization:
  - Dropout 0.1-0.4
  - Weight decay 1e-5 to 1e-3
  - L1/L2

Loss functions:
  - MSE
  - Huber
  - Weighted MSE

Data augmentation:
  - TTA for mixtures
  - Both orderings


In [4]:
# What approaches have NOT been tried or tried poorly?
print("\n" + "="*60)
print("APPROACHES NOT YET TRIED (or tried poorly)")
print("="*60)

approaches_not_tried = [
    ("1. Domain Adaptation", "Train on source domain, adapt to target domain", "NOT TRIED"),
    ("2. Meta-Learning (MAML)", "Learn to quickly adapt to new solvents", "NOT TRIED"),
    ("3. Transductive Learning", "Process test samples jointly with training", "NOT TRIED"),
    ("4. Adversarial Training", "Train to be robust to distribution shift", "NOT TRIED"),
    ("5. Importance Weighting", "Re-weight training samples by density ratio", "NOT TRIED"),
    ("6. Test-Time Adaptation", "Adapt model at test time using unlabeled data", "NOT TRIED"),
    ("7. Proper GNN", "Graph attention network with proper implementation", "TRIED POORLY"),
    ("8. Ensemble of Diverse Architectures", "GNN + MLP + LGBM", "NOT TRIED"),
    ("9. Prediction Calibration", "Post-hoc calibration of predictions", "NOT TRIED"),
    ("10. Uncertainty-Weighted Ensemble", "Weight predictions by uncertainty", "NOT TRIED"),
]

for name, desc, status in approaches_not_tried:
    print(f"\n{name}")
    print(f"  Description: {desc}")
    print(f"  Status: {status}")


APPROACHES NOT YET TRIED (or tried poorly)

1. Domain Adaptation
  Description: Train on source domain, adapt to target domain
  Status: NOT TRIED

2. Meta-Learning (MAML)
  Description: Learn to quickly adapt to new solvents
  Status: NOT TRIED

3. Transductive Learning
  Description: Process test samples jointly with training
  Status: NOT TRIED

4. Adversarial Training
  Description: Train to be robust to distribution shift
  Status: NOT TRIED

5. Importance Weighting
  Description: Re-weight training samples by density ratio
  Status: NOT TRIED

6. Test-Time Adaptation
  Description: Adapt model at test time using unlabeled data
  Status: NOT TRIED

7. Proper GNN
  Description: Graph attention network with proper implementation
  Status: TRIED POORLY

8. Ensemble of Diverse Architectures
  Description: GNN + MLP + LGBM
  Status: NOT TRIED

9. Prediction Calibration
  Description: Post-hoc calibration of predictions
  Status: NOT TRIED

10. Uncertainty-Weighted Ensemble
  Descripti

In [5]:
# Key insight: The CV-LB gap is due to distribution shift
print("\n" + "="*60)
print("KEY INSIGHT: Distribution Shift Analysis")
print("="*60)

print("""
The CV-LB gap is caused by distribution shift between:
1. Training solvents (23 solvents in each fold)
2. Test solvent (1 held-out solvent)

The held-out solvent is CHEMICALLY DIFFERENT from training solvents.
This is why the model fails to generalize.

High-error solvents (from previous analysis):
- HFIP (fluorinated): CV error = 0.038
- TFE (fluorinated): CV error = 0.015
- Cyclohexane (non-polar): CV error = 0.026
- Water.Ethanol (polar protic): CV error = 0.028

These solvents are OUTLIERS in the feature space.
""")


KEY INSIGHT: Distribution Shift Analysis

The CV-LB gap is caused by distribution shift between:
1. Training solvents (23 solvents in each fold)
2. Test solvent (1 held-out solvent)

The held-out solvent is CHEMICALLY DIFFERENT from training solvents.
This is why the model fails to generalize.

High-error solvents (from previous analysis):
- HFIP (fluorinated): CV error = 0.038
- TFE (fluorinated): CV error = 0.015
- Cyclohexane (non-polar): CV error = 0.026
- Water.Ethanol (polar protic): CV error = 0.028

These solvents are OUTLIERS in the feature space.



In [6]:
# What would change the CV-LB relationship?
print("\n" + "="*60)
print("WHAT COULD CHANGE THE CV-LB RELATIONSHIP?")
print("="*60)

print("""
1. **Better OOD Generalization**
   - Use features that capture fundamental chemistry
   - Use simpler models that don't overfit to training distribution
   - Use regularization that encourages generalization

2. **Domain Adaptation**
   - Re-weight training samples to match test distribution
   - Use adversarial training to learn domain-invariant features
   - Use importance weighting based on density ratio

3. **Prediction Calibration**
   - Learn a calibration function from CV-LB relationship
   - Apply calibration to shift predictions toward LB
   - This could reduce the intercept

4. **Ensemble Diversity**
   - Use models with different inductive biases
   - Some models may have better CV-LB correlation
   - Weight models by their CV-LB correlation
""")


WHAT COULD CHANGE THE CV-LB RELATIONSHIP?

1. **Better OOD Generalization**
   - Use features that capture fundamental chemistry
   - Use simpler models that don't overfit to training distribution
   - Use regularization that encourages generalization

2. **Domain Adaptation**
   - Re-weight training samples to match test distribution
   - Use adversarial training to learn domain-invariant features
   - Use importance weighting based on density ratio

3. **Prediction Calibration**
   - Learn a calibration function from CV-LB relationship
   - Apply calibration to shift predictions toward LB
   - This could reduce the intercept

4. **Ensemble Diversity**
   - Use models with different inductive biases
   - Some models may have better CV-LB correlation
   - Weight models by their CV-LB correlation



In [7]:
# Analyze the residuals from CV-LB relationship
df['predicted_lb'] = slope * df['cv'] + intercept
df['residual'] = df['lb'] - df['predicted_lb']

print("Residual Analysis:")
print(df[['exp', 'model', 'cv', 'lb', 'predicted_lb', 'residual']].to_string())

print(f"\nMean residual: {df['residual'].mean():.6f}")
print(f"Std residual: {df['residual'].std():.6f}")

# Which experiments beat the predicted LB?
print(f"\nExperiments that beat predicted LB (negative residual):")
for _, row in df[df['residual'] < 0].iterrows():
    print(f"  {row['exp']} ({row['model']}): residual = {row['residual']:.6f}")

Residual Analysis:
        exp             model      cv      lb  predicted_lb  residual
0   exp_000      MLP baseline  0.0111  0.0982      0.100413 -0.002213
1   exp_001          LightGBM  0.0123  0.1065      0.105591  0.000909
2   exp_003   Spange+DRFP MLP  0.0105  0.0972      0.097825 -0.000625
3   exp_005    Large ensemble  0.0104  0.0969      0.097393 -0.000493
4   exp_006       Simpler MLP  0.0097  0.0946      0.094373  0.000227
5   exp_007      Even simpler  0.0093  0.0932      0.092647  0.000553
6   exp_009  Ridge regression  0.0092  0.0936      0.092215  0.001385
7   exp_012   Simple ensemble  0.0090  0.0913      0.091353 -0.000053
8   exp_024           ACS PCA  0.0087  0.0893      0.090058 -0.000758
9   exp_026     Weighted loss  0.0085  0.0887      0.089195 -0.000495
10  exp_030       GP+MLP+LGBM  0.0083  0.0877      0.088332 -0.000632
11  exp_035   Lower GP weight  0.0098  0.0970      0.094804  0.002196

Mean residual: -0.000000
Std residual: 0.001155

Experiments that beat

In [8]:
# What if we could reduce the intercept?
print("\n" + "="*60)
print("SCENARIO ANALYSIS: What if we could reduce the intercept?")
print("="*60)

for new_intercept in [0.05, 0.04, 0.03, 0.02, 0.01, 0.0]:
    new_lb = slope * 0.008298 + new_intercept
    status = "✓ BEATS TARGET" if new_lb < 0.0347 else ""
    print(f"Intercept = {new_intercept:.2f} → LB = {new_lb:.4f} {status}")

print("\nTo beat target (0.0347), we need intercept < 0.0 (impossible with current relationship)")
print("OR we need to CHANGE the slope/intercept relationship entirely.")


SCENARIO ANALYSIS: What if we could reduce the intercept?
Intercept = 0.05 → LB = 0.0858 
Intercept = 0.04 → LB = 0.0758 
Intercept = 0.03 → LB = 0.0658 
Intercept = 0.02 → LB = 0.0558 
Intercept = 0.01 → LB = 0.0458 
Intercept = 0.00 → LB = 0.0358 

To beat target (0.0347), we need intercept < 0.0 (impossible with current relationship)
OR we need to CHANGE the slope/intercept relationship entirely.


In [9]:
# Final recommendation
print("\n" + "="*60)
print("FINAL RECOMMENDATION")
print("="*60)

print("""
Given that:
1. 23 consecutive experiments have failed to beat exp_030
2. The CV-LB relationship has intercept > target
3. All refinements to GP + MLP + LGBM have failed

We need to try something FUNDAMENTALLY DIFFERENT:

**PRIORITY 1: Prediction Calibration**
- Learn a calibration function from the CV-LB relationship
- Apply calibration to shift predictions toward LB
- This could reduce the effective intercept

**PRIORITY 2: Simpler Model with Stronger Regularization**
- The GP component in exp_030 provides strong regularization
- Try GP-only model with optimized hyperparameters
- Or try Ridge regression with very strong regularization

**PRIORITY 3: Feature Simplification**
- Use only the most fundamental features (Spange only)
- Drop DRFP features which may be causing overfitting
- Focus on features that capture fundamental chemistry

**PRIORITY 4: Submit exp_030 with Calibration**
- Apply a simple calibration to exp_030 predictions
- This is the safest approach with 5 submissions remaining

**DO NOT:**
- Try more complex architectures (they don't help)
- Try more features (DRFP consistently hurts)
- Try more ensemble variations (exhausted)
""")


FINAL RECOMMENDATION

Given that:
1. 23 consecutive experiments have failed to beat exp_030
2. The CV-LB relationship has intercept > target
3. All refinements to GP + MLP + LGBM have failed

We need to try something FUNDAMENTALLY DIFFERENT:

**PRIORITY 1: Prediction Calibration**
- Learn a calibration function from the CV-LB relationship
- Apply calibration to shift predictions toward LB
- This could reduce the effective intercept

**PRIORITY 2: Simpler Model with Stronger Regularization**
- The GP component in exp_030 provides strong regularization
- Try GP-only model with optimized hyperparameters
- Or try Ridge regression with very strong regularization

**PRIORITY 3: Feature Simplification**
- Use only the most fundamental features (Spange only)
- Drop DRFP features which may be causing overfitting
- Focus on features that capture fundamental chemistry

**PRIORITY 4: Submit exp_030 with Calibration**
- Apply a simple calibration to exp_030 predictions
- This is the safest approac

In [None]:
# Summary
print("\n" + "="*60)
print("SUMMARY")
print("="*60)

print(f"""
Current State:
- Best CV: 0.008298 (exp_030)
- Best LB: 0.08772 (exp_030)
- Target: 0.0347
- Gap: 2.53x

CV-LB Relationship:
- LB = 4.31*CV + 0.0525
- Intercept (0.0525) > Target (0.0347)
- Required CV to hit target: NEGATIVE (impossible)

Key Insight:
- The target is UNREACHABLE by improving CV alone
- We need to CHANGE the CV-LB relationship
- This requires fundamentally different approaches

Remaining Submissions: 5
Best Model: exp_030 (GP 0.15 + MLP 0.55 + LGBM 0.3)

Next Steps:
1. Try prediction calibration
2. Try simpler model with stronger regularization
3. Try feature simplification
""")