# Loop 62 Analysis: Importance-Weighted Training Failed

## Key Findings
1. Importance-weighted training (exp_063) CV 0.010426 is 27.24% WORSE than best CV 0.008194
2. Adversarial validation-based importance weighting didn't help
3. This confirms the CV-LB gap is NOT due to simple covariate shift

## Strategic Analysis
- The CV-LB relationship: LB = 4.22×CV + 0.0534 (R²=0.98)
- Intercept (0.0534) > Target (0.0347) by 53.9%
- 63 experiments tried, all follow the same CV-LB relationship
- Only 3 submissions remaining

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Submission history
submissions = [
    ('exp_000', 0.011081, 0.0982),
    ('exp_001', 0.012297, 0.1065),
    ('exp_003', 0.010501, 0.0972),
    ('exp_005', 0.010430, 0.0969),
    ('exp_006', 0.009749, 0.0946),
    ('exp_007', 0.009262, 0.0932),
    ('exp_009', 0.009192, 0.0936),
    ('exp_012', 0.009004, 0.0913),
    ('exp_024', 0.008689, 0.0893),
    ('exp_026', 0.008465, 0.0887),
    ('exp_030', 0.008298, 0.0877),
    ('exp_041', 0.009002, 0.0932),
    ('exp_042', 0.014503, 0.1147),
]

df = pd.DataFrame(submissions, columns=['exp', 'cv', 'lb'])
df['ratio'] = df['lb'] / df['cv']
df['residual'] = df['lb'] - (4.22 * df['cv'] + 0.0534)

print('=== SUBMISSION ANALYSIS ===')
print(df.to_string())
print(f'\nBest CV: {df["cv"].min():.6f} ({df.loc[df["cv"].idxmin(), "exp"]})')
print(f'Best LB: {df["lb"].min():.4f} ({df.loc[df["lb"].idxmin(), "exp"]})')
print(f'Target: 0.0347')
print(f'Gap to target: {(df["lb"].min() - 0.0347) / 0.0347 * 100:.1f}%')

=== SUBMISSION ANALYSIS ===
        exp        cv      lb      ratio  residual
0   exp_000  0.011081  0.0982   8.862016 -0.001962
1   exp_001  0.012297  0.1065   8.660649  0.001207
2   exp_003  0.010501  0.0972   9.256261 -0.000514
3   exp_005  0.010430  0.0969   9.290508 -0.000515
4   exp_006  0.009749  0.0946   9.703559  0.000059
5   exp_007  0.009262  0.0932  10.062621  0.000714
6   exp_009  0.009192  0.0936  10.182768  0.001410
7   exp_012  0.009004  0.0913  10.139938 -0.000097
8   exp_024  0.008689  0.0893  10.277362 -0.000768
9   exp_026  0.008465  0.0887  10.478441 -0.000422
10  exp_030  0.008298  0.0877  10.568812 -0.000718
11  exp_041  0.009002  0.0932  10.353255  0.001812
12  exp_042  0.014503  0.1147   7.908709  0.000097

Best CV: 0.008298 (exp_030)
Best LB: 0.0877 (exp_030)
Target: 0.0347
Gap to target: 152.7%


In [2]:
# CV-LB relationship analysis
from scipy import stats

cv = df['cv'].values
lb = df['lb'].values

slope, intercept, r_value, p_value, std_err = stats.linregress(cv, lb)

print(f'\n=== CV-LB RELATIONSHIP ===')
print(f'LB = {slope:.2f} × CV + {intercept:.4f}')
print(f'R² = {r_value**2:.4f}')
print(f'\nIntercept: {intercept:.4f}')
print(f'Target: 0.0347')
print(f'Intercept > Target: {intercept > 0.0347}')
print(f'\nRequired CV to hit target: {(0.0347 - intercept) / slope:.6f}')
print(f'This is NEGATIVE, meaning target is UNREACHABLE with current approach')


=== CV-LB RELATIONSHIP ===
LB = 4.22 × CV + 0.0534
R² = 0.9810

Intercept: 0.0534
Target: 0.0347
Intercept > Target: True

Required CV to hit target: -0.004429
This is NEGATIVE, meaning target is UNREACHABLE with current approach


In [3]:
# Analyze experiments that generalized better than expected
print('\n=== EXPERIMENTS WITH NEGATIVE RESIDUALS (BETTER THAN EXPECTED) ===')
better = df[df['residual'] < 0].sort_values('residual')
print(better.to_string())

print('\n=== EXPERIMENTS WITH POSITIVE RESIDUALS (WORSE THAN EXPECTED) ===')
worse = df[df['residual'] > 0].sort_values('residual', ascending=False)
print(worse.to_string())


=== EXPERIMENTS WITH NEGATIVE RESIDUALS (BETTER THAN EXPECTED) ===
        exp        cv      lb      ratio  residual
0   exp_000  0.011081  0.0982   8.862016 -0.001962
8   exp_024  0.008689  0.0893  10.277362 -0.000768
10  exp_030  0.008298  0.0877  10.568812 -0.000718
3   exp_005  0.010430  0.0969   9.290508 -0.000515
2   exp_003  0.010501  0.0972   9.256261 -0.000514
9   exp_026  0.008465  0.0887  10.478441 -0.000422
7   exp_012  0.009004  0.0913  10.139938 -0.000097

=== EXPERIMENTS WITH POSITIVE RESIDUALS (WORSE THAN EXPECTED) ===
        exp        cv      lb      ratio  residual
11  exp_041  0.009002  0.0932  10.353255  0.001812
6   exp_009  0.009192  0.0936  10.182768  0.001410
1   exp_001  0.012297  0.1065   8.660649  0.001207
5   exp_007  0.009262  0.0932  10.062621  0.000714
12  exp_042  0.014503  0.1147   7.908709  0.000097
4   exp_006  0.009749  0.0946   9.703559  0.000059


In [4]:
# What would we need to achieve the target?
print('\n=== WHAT WOULD WE NEED TO HIT TARGET? ===')
target = 0.0347

# Option 1: Reduce intercept
print(f'\nOption 1: Reduce intercept to 0.0347 (currently {intercept:.4f})')
print(f'  Required reduction: {(intercept - target) / intercept * 100:.1f}%')

# Option 2: Reduce slope
print(f'\nOption 2: Reduce slope (currently {slope:.2f})')
print(f'  With best CV ({df["cv"].min():.6f}), need slope = {(target - intercept) / df["cv"].min():.2f}')
print(f'  This is NEGATIVE, so slope reduction alone cannot help')

# Option 3: Both
print(f'\nOption 3: Find approach with different CV-LB relationship')
print(f'  Need intercept < {target:.4f} OR slope < 0')
print(f'  Current: LB = {slope:.2f}×CV + {intercept:.4f}')
print(f'  Target: LB = 0.0347 when CV = 0.008194 (best CV)')
print(f'  Required: LB = 0.0347 = a×0.008194 + b')
print(f'  If a = 4.22, then b = 0.0347 - 4.22×0.008194 = {0.0347 - 4.22*0.008194:.4f}')
print(f'  This is NEGATIVE, so we need a fundamentally different approach')


=== WHAT WOULD WE NEED TO HIT TARGET? ===

Option 1: Reduce intercept to 0.0347 (currently 0.0534)
  Required reduction: 35.0%

Option 2: Reduce slope (currently 4.22)
  With best CV (0.008298), need slope = -2.25
  This is NEGATIVE, so slope reduction alone cannot help

Option 3: Find approach with different CV-LB relationship
  Need intercept < 0.0347 OR slope < 0
  Current: LB = 4.22×CV + 0.0534
  Target: LB = 0.0347 when CV = 0.008194 (best CV)
  Required: LB = 0.0347 = a×0.008194 + b
  If a = 4.22, then b = 0.0347 - 4.22×0.008194 = 0.0001
  This is NEGATIVE, so we need a fundamentally different approach


In [5]:
# Analyze what approaches have been tried
approaches_tried = [
    ('MLP baseline', 'exp_000', 0.011081, 'BASELINE'),
    ('LightGBM', 'exp_001', 0.012297, 'WORSE'),
    ('DRFP features', 'exp_002', 0.016948, 'WORSE'),
    ('Combined features', 'exp_003', 0.010501, 'BETTER'),
    ('Deep residual', 'exp_004', 0.051912, 'FAILED'),
    ('Large ensemble', 'exp_005', 0.010430, 'BETTER'),
    ('Simpler model', 'exp_006', 0.009749, 'BETTER'),
    ('GP ensemble', 'exp_030', 0.008298, 'BEST'),
    ('GNN', 'exp_051', 0.014080, 'WORSE'),
    ('ChemBERTa', 'exp_052', 0.019400, 'WORSE'),
    ('Per-target', 'exp_053', 0.009946, 'WORSE'),
    ('Per-solvent-type', 'exp_054', 0.019519, 'WORSE'),
    ('Hyperparameter opt', 'exp_055', 0.012658, 'WORSE'),
    ('Advanced GNN', 'exp_056', 0.030013, 'WORSE'),
    ('Multi-seed ensemble', 'exp_057', 0.009449, 'WORSE'),
    ('Per-target weights', 'exp_058', 0.008701, 'CLOSE'),
    ('Physical constraints', 'exp_059', 0.009622, 'WORSE'),
    ('Spange only', 'exp_060', 0.011266, 'WORSE'),
    ('TabNet', 'exp_061', 0.036660, 'FAILED'),
    ('CQR', 'exp_062', 0.009899, 'WORSE'),
    ('Importance weighted', 'exp_063', 0.010426, 'WORSE'),
]

print('\n=== APPROACHES TRIED (63 experiments) ===')
for name, exp, cv, status in approaches_tried:
    print(f'{status:8s} | {exp}: CV={cv:.6f} | {name}')


=== APPROACHES TRIED (63 experiments) ===
BASELINE | exp_000: CV=0.011081 | MLP baseline
WORSE    | exp_001: CV=0.012297 | LightGBM
WORSE    | exp_002: CV=0.016948 | DRFP features
BETTER   | exp_003: CV=0.010501 | Combined features
FAILED   | exp_004: CV=0.051912 | Deep residual
BETTER   | exp_005: CV=0.010430 | Large ensemble
BETTER   | exp_006: CV=0.009749 | Simpler model
BEST     | exp_030: CV=0.008298 | GP ensemble
WORSE    | exp_051: CV=0.014080 | GNN
WORSE    | exp_052: CV=0.019400 | ChemBERTa
WORSE    | exp_053: CV=0.009946 | Per-target
WORSE    | exp_054: CV=0.019519 | Per-solvent-type
WORSE    | exp_055: CV=0.012658 | Hyperparameter opt
WORSE    | exp_056: CV=0.030013 | Advanced GNN
WORSE    | exp_057: CV=0.009449 | Multi-seed ensemble
CLOSE    | exp_058: CV=0.008701 | Per-target weights
WORSE    | exp_059: CV=0.009622 | Physical constraints
WORSE    | exp_060: CV=0.011266 | Spange only
FAILED   | exp_061: CV=0.036660 | TabNet
WORSE    | exp_062: CV=0.009899 | CQR
WORSE    | 

In [6]:
# What approaches haven't been tried?
print('\n=== APPROACHES NOT YET TRIED ===')
print('''
1. DOMAIN-SPECIFIC APPROACHES:
   - Reaction mechanism-based features (transition state energies)
   - Solvent-solute interaction energies
   - Marcus theory-based features for electron transfer
   - Hammett sigma values for substituent effects

2. ENSEMBLE DIVERSITY:
   - Stacking with meta-learner (tried but may need different base models)
   - Blending with out-of-fold predictions
   - Negative correlation learning

3. DATA AUGMENTATION:
   - Synthetic data generation for unseen solvents
   - Interpolation between known solvents
   - Physics-based simulation data

4. TRANSFER LEARNING:
   - Pre-trained molecular representations (beyond ChemBERTa)
   - Multi-task learning with related reactions
   - Domain adaptation from similar reactions

5. UNCERTAINTY QUANTIFICATION:
   - Bayesian neural networks
   - Deep ensembles with uncertainty
   - Conformal prediction (tried CQR, but not full conformal)

6. REGULARIZATION:
   - Mixup augmentation
   - Label smoothing
   - Spectral normalization
''')


=== APPROACHES NOT YET TRIED ===

1. DOMAIN-SPECIFIC APPROACHES:
   - Reaction mechanism-based features (transition state energies)
   - Solvent-solute interaction energies
   - Marcus theory-based features for electron transfer
   - Hammett sigma values for substituent effects

2. ENSEMBLE DIVERSITY:
   - Stacking with meta-learner (tried but may need different base models)
   - Blending with out-of-fold predictions
   - Negative correlation learning

3. DATA AUGMENTATION:
   - Synthetic data generation for unseen solvents
   - Interpolation between known solvents
   - Physics-based simulation data

4. TRANSFER LEARNING:
   - Pre-trained molecular representations (beyond ChemBERTa)
   - Multi-task learning with related reactions
   - Domain adaptation from similar reactions

5. UNCERTAINTY QUANTIFICATION:
   - Bayesian neural networks
   - Deep ensembles with uncertainty
   - Conformal prediction (tried CQR, but not full conformal)

6. REGULARIZATION:
   - Mixup augmentation
   - Lab

In [7]:
# Key insight: The problem is EXTRAPOLATION, not INTERPOLATION
print('\n=== KEY INSIGHT: EXTRAPOLATION VS INTERPOLATION ===')
print('''
The Leave-One-Solvent-Out CV tests EXTRAPOLATION to new chemical entities.
This is fundamentally harder than interpolation within known entities.

The CV-LB gap is NOT due to:
- Covariate shift (importance weighting didn't help)
- CV procedure (GroupKFold didn't help)
- Loss function (Huber, MSE, Quantile all similar)
- Model complexity (simpler and complex models similar)

The CV-LB gap IS due to:
- Extrapolation to unseen solvents
- The test set contains solvents that are chemically different from training
- The model cannot generalize to truly novel chemical entities

POSSIBLE SOLUTIONS:
1. Find features that capture chemical similarity better
2. Use transfer learning from larger molecular datasets
3. Use physics-based constraints that generalize
4. Accept that some extrapolation error is irreducible
''')


=== KEY INSIGHT: EXTRAPOLATION VS INTERPOLATION ===

The Leave-One-Solvent-Out CV tests EXTRAPOLATION to new chemical entities.
This is fundamentally harder than interpolation within known entities.

The CV-LB gap is NOT due to:
- Covariate shift (importance weighting didn't help)
- CV procedure (GroupKFold didn't help)
- Loss function (Huber, MSE, Quantile all similar)
- Model complexity (simpler and complex models similar)

The CV-LB gap IS due to:
- Extrapolation to unseen solvents
- The test set contains solvents that are chemically different from training
- The model cannot generalize to truly novel chemical entities

POSSIBLE SOLUTIONS:
1. Find features that capture chemical similarity better
2. Use transfer learning from larger molecular datasets
3. Use physics-based constraints that generalize
4. Accept that some extrapolation error is irreducible



In [8]:
# Final recommendation
print('\n=== FINAL RECOMMENDATION ===')
print('''
With only 3 submissions remaining and best LB at 0.0877 (target 0.0347):

1. DO NOT submit exp_063 (importance weighted) - CV is 27% worse

2. FOCUS on approaches that could CHANGE the CV-LB relationship:
   - The current relationship has intercept > target
   - We need to find an approach with lower intercept
   - This likely requires fundamentally different features or model

3. CONSIDER:
   - Bayesian optimization of ensemble weights
   - Physics-informed neural networks
   - Domain adaptation techniques
   - Uncertainty-weighted predictions

4. SAVE submissions for:
   - A fundamentally different approach (if found)
   - Final ensemble of best models
   - Last-ditch attempt with best CV model

The target (0.0347) IS reachable, but requires finding an approach
that breaks the current CV-LB relationship.
''')


=== FINAL RECOMMENDATION ===

With only 3 submissions remaining and best LB at 0.0877 (target 0.0347):

1. DO NOT submit exp_063 (importance weighted) - CV is 27% worse

2. FOCUS on approaches that could CHANGE the CV-LB relationship:
   - The current relationship has intercept > target
   - We need to find an approach with lower intercept
   - This likely requires fundamentally different features or model

3. CONSIDER:
   - Bayesian optimization of ensemble weights
   - Physics-informed neural networks
   - Domain adaptation techniques
   - Uncertainty-weighted predictions

4. SAVE submissions for:
   - A fundamentally different approach (if found)
   - Final ensemble of best models
   - Last-ditch attempt with best CV model

The target (0.0347) IS reachable, but requires finding an approach
that breaks the current CV-LB relationship.



In [None]:
# Visualize the CV-LB relationship
plt.figure(figsize=(10, 6))
plt.scatter(cv, lb, s=100, alpha=0.7)

# Regression line
cv_range = np.linspace(0, 0.016, 100)
lb_pred = slope * cv_range + intercept
plt.plot(cv_range, lb_pred, 'r--', label=f'LB = {slope:.2f}×CV + {intercept:.4f} (R²={r_value**2:.2f})')

# Target line
plt.axhline(y=0.0347, color='g', linestyle=':', label='Target (0.0347)')

# Intercept
plt.axhline(y=intercept, color='orange', linestyle=':', alpha=0.5, label=f'Intercept ({intercept:.4f})')

# Labels
for i, row in df.iterrows():
    plt.annotate(row['exp'], (row['cv'], row['lb']), fontsize=8, alpha=0.7)

plt.xlabel('CV Score')
plt.ylabel('LB Score')
plt.title('CV-LB Relationship: Intercept > Target')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('/home/code/exploration/loop62_cv_lb.png', dpi=150)
plt.show()

print('\nPlot saved to /home/code/exploration/loop62_cv_lb.png')