# Loop 43 Analysis: Strategic Assessment

## Key Findings from Experiment 042 (Calibration)

1. **Stronger regularization HURTS**: CV 0.010008 (22.1% worse than baseline 0.008194)
2. **Post-hoc calibration provides only 4.63% improvement** - but can't be used in submission
3. **Predictions are already well-calibrated** - mean error is only -0.005
4. **Outlier solvents dominate error**: Fluorinated alcohols (HFIP, TFE) have 4-5x higher MSE

## Critical CV-LB Gap Analysis

The evaluator correctly identified that:
- LB = 4.29*CV + 0.0528 (R²=0.97)
- Intercept (0.0528) > Target (0.0347)
- This means even CV=0 would give LB=0.0528 > target

**BUT THE TARGET EXISTS** - someone achieved 0.0347. This means there's an approach with a DIFFERENT CV-LB relationship.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Submission history
submissions = [
    {'exp': 'exp_000', 'cv': 0.0111, 'lb': 0.0982},
    {'exp': 'exp_001', 'cv': 0.0123, 'lb': 0.1065},
    {'exp': 'exp_003', 'cv': 0.0105, 'lb': 0.0972},
    {'exp': 'exp_005', 'cv': 0.0104, 'lb': 0.0969},
    {'exp': 'exp_006', 'cv': 0.0097, 'lb': 0.0946},
    {'exp': 'exp_007', 'cv': 0.0093, 'lb': 0.0932},
    {'exp': 'exp_009', 'cv': 0.0092, 'lb': 0.0936},
    {'exp': 'exp_012', 'cv': 0.0090, 'lb': 0.0913},
    {'exp': 'exp_024', 'cv': 0.0087, 'lb': 0.0893},
    {'exp': 'exp_026', 'cv': 0.0085, 'lb': 0.0887},
    {'exp': 'exp_030', 'cv': 0.0083, 'lb': 0.0877},
    {'exp': 'exp_035', 'cv': 0.0098, 'lb': 0.0970},
]

df = pd.DataFrame(submissions)
print("Submission History:")
print(df.to_string(index=False))

In [None]:
# Fit linear regression to CV-LB relationship
from sklearn.linear_model import LinearRegression

X = df['cv'].values.reshape(-1, 1)
y = df['lb'].values

lr = LinearRegression()
lr.fit(X, y)

slope = lr.coef_[0]
intercept = lr.intercept_
r2 = lr.score(X, y)

print(f"CV-LB Relationship: LB = {slope:.2f} * CV + {intercept:.4f}")
print(f"R² = {r2:.4f}")
print()
print(f"Target: 0.0347")
print(f"Intercept: {intercept:.4f}")
print(f"Gap: Intercept - Target = {intercept - 0.0347:.4f}")
print()
print("CRITICAL: Intercept > Target means current approach CANNOT reach target!")

In [None]:
# What CV would we need to reach target?
target = 0.0347
required_cv = (target - intercept) / slope
print(f"Required CV to reach target: {required_cv:.6f}")
print()
if required_cv < 0:
    print("IMPOSSIBLE: Required CV is NEGATIVE!")
    print("We need a fundamentally different approach that changes the CV-LB relationship.")
else:
    print(f"We need to improve CV from {df['cv'].min():.4f} to {required_cv:.4f}")
    print(f"That's a {(df['cv'].min() - required_cv) / df['cv'].min() * 100:.1f}% improvement")

In [None]:
# Analyze the outlier solvents from exp_042
outlier_solvents = {
    '1,1,1,3,3,3-Hexafluoropropan-2-ol': 0.040084,
    'Acetonitrile.Acetic Acid': 0.021430,
    'Dimethyl Carbonate': 0.016953,
    '2,2,2-Trifluoroethanol': 0.014613,
    'Diethyl Ether [Ether]': 0.014008,
    'Ethylene Glycol [1,2-Ethanediol]': 0.013649,
}

mean_mse = 0.008972
median_mse = 0.007715

print("Outlier Solvents (MSE > mean):")
for solvent, mse in outlier_solvents.items():
    ratio = mse / mean_mse
    print(f"  {solvent}: MSE = {mse:.6f} ({ratio:.1f}x mean)")

print(f"\nMean MSE: {mean_mse:.6f}")
print(f"Median MSE: {median_mse:.6f}")
print(f"\nTop 2 outliers contribute {(0.040084 + 0.021430) / 24:.6f} to mean MSE")
print(f"That's {(0.040084 + 0.021430) / 24 / mean_mse * 100:.1f}% of total error from just 2 solvents!")

In [None]:
# What if we could perfectly predict the outlier solvents?
# Estimate impact on CV

# Current mean MSE: 0.008972 (24 solvents)
# If HFIP (0.040084) and TFE (0.014613) were perfect (0.0):
# New mean = (0.008972 * 24 - 0.040084 - 0.014613) / 24

current_total = 0.008972 * 24
outlier_contribution = 0.040084 + 0.014613  # HFIP + TFE
new_total = current_total - outlier_contribution
new_mean = new_total / 24

print(f"Current mean MSE: {0.008972:.6f}")
print(f"If fluorinated alcohols were perfect: {new_mean:.6f}")
print(f"Improvement: {(0.008972 - new_mean) / 0.008972 * 100:.1f}%")
print()
print("Even with perfect fluorinated alcohol predictions, we'd still have CV ~0.0067")
print(f"Predicted LB with CV=0.0067: {slope * 0.0067 + intercept:.4f}")
print("Still far from target 0.0347!")

## Key Insight: The Problem is NOT Just Outlier Solvents

Even if we perfectly predicted the fluorinated alcohols, we'd still have:
- CV ≈ 0.0067
- Predicted LB ≈ 0.082
- Still 2.4x away from target (0.0347)

**The fundamental issue is the CV-LB relationship itself.**

## What Could Change the CV-LB Relationship?

1. **Different evaluation scheme** - But we're using the exact template
2. **Different features** - We've tried many combinations
3. **Different model architecture** - GNN failed for us
4. **Post-processing** - Calibration doesn't help

## Unexplored Direction: Non-Linear Mixing for Mixtures

The current approach uses LINEAR mixing of solvent descriptors:
```
spange_mix = (1 - pct_b) * spange_a + pct_b * spange_b
```

But real solvent mixtures often exhibit NON-LINEAR behavior:
- Synergistic effects
- Antagonistic effects
- Phase separation

**What if we added interaction terms?**
```
spange_mix = (1 - pct_b) * spange_a + pct_b * spange_b + c * spange_a * spange_b
```

In [None]:
# Let's check how much of our data is mixtures vs single solvents
print("Data composition:")
print(f"  Single solvent: 656 samples (24 solvents)")
print(f"  Full data (mixtures): 1227 samples")
print(f"  Mixture samples: {1227 - 656} = 571 samples")
print()
print("Mixture data is significant (~46% of full data)")
print("If mixture predictions are systematically worse, this could explain the CV-LB gap")

## Strategic Recommendations

### Priority 1: Non-Linear Mixture Features
Add interaction terms between solvent A and B descriptors:
- `spange_a * spange_b` (element-wise product)
- `|spange_a - spange_b|` (absolute difference)
- Polynomial mixing: `a*A + b*B + c*A*B + d*A²*B + e*A*B²`

### Priority 2: Solvent-Specific Models
Train separate models for:
- Single solvents (24 folds)
- Mixtures (13 folds)
- Then combine predictions

### Priority 3: Focus on Mixture Predictions
The CV-LB gap might be driven by mixture predictions being worse on LB.
Analyze mixture vs single solvent errors separately.

### Priority 4: Try Different Ensemble Weights for Mixtures
The optimal GP/MLP/LGBM weights might be different for mixtures vs single solvents.

### What NOT to Try
- Stronger regularization (already tested, 22% worse)
- Post-hoc calibration (can't use in submission)
- GNN/ChemBERTa (already failed)
- Minimal features (already failed)