# Loop 46 Analysis: Post Mean Reversion - What's Next?

## Key Findings from Experiment 045:
- Mean reversion HURTS CV performance (alpha=1.0 is best)
- The predictions are NOT systematically biased away from the mean
- The CV-LB intercept problem is NOT due to prediction bias

## Evaluator's Key Insight:
The evaluator discovered that public kernels use **GroupKFold (5 splits)** instead of **Leave-One-Out (24 folds)**!

This could explain the entire CV-LB gap. Let me investigate.

In [1]:
# Load submission history and analyze CV-LB relationship
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

# Submission history
submissions = [
    {'exp': 'exp_000', 'cv': 0.0111, 'lb': 0.0982},
    {'exp': 'exp_001', 'cv': 0.0123, 'lb': 0.1065},
    {'exp': 'exp_003', 'cv': 0.0105, 'lb': 0.0972},
    {'exp': 'exp_005', 'cv': 0.0104, 'lb': 0.0969},
    {'exp': 'exp_006', 'cv': 0.0097, 'lb': 0.0946},
    {'exp': 'exp_007', 'cv': 0.0093, 'lb': 0.0932},
    {'exp': 'exp_009', 'cv': 0.0092, 'lb': 0.0936},
    {'exp': 'exp_012', 'cv': 0.0090, 'lb': 0.0913},
    {'exp': 'exp_024', 'cv': 0.0087, 'lb': 0.0893},
    {'exp': 'exp_026', 'cv': 0.0085, 'lb': 0.0887},
    {'exp': 'exp_030', 'cv': 0.0083, 'lb': 0.0877},
    {'exp': 'exp_035', 'cv': 0.0098, 'lb': 0.0970},
]

df = pd.DataFrame(submissions)
print('Submission history:')
print(df)
print()

# Linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(df['cv'], df['lb'])
print(f'CV-LB Relationship: LB = {slope:.2f} * CV + {intercept:.4f}')
print(f'R² = {r_value**2:.4f}')
print(f'Intercept = {intercept:.4f}')
print(f'Target = 0.0347')
print()
print(f'CRITICAL: Intercept ({intercept:.4f}) > Target (0.0347)')
print(f'This means even CV=0 would give LB={intercept:.4f}, which is ABOVE target!')

Submission history:
        exp      cv      lb
0   exp_000  0.0111  0.0982
1   exp_001  0.0123  0.1065
2   exp_003  0.0105  0.0972
3   exp_005  0.0104  0.0969
4   exp_006  0.0097  0.0946
5   exp_007  0.0093  0.0932
6   exp_009  0.0092  0.0936
7   exp_012  0.0090  0.0913
8   exp_024  0.0087  0.0893
9   exp_026  0.0085  0.0887
10  exp_030  0.0083  0.0877
11  exp_035  0.0098  0.0970

CV-LB Relationship: LB = 4.31 * CV + 0.0525
R² = 0.9505
Intercept = 0.0525
Target = 0.0347

CRITICAL: Intercept (0.0525) > Target (0.0347)
This means even CV=0 would give LB=0.0525, which is ABOVE target!


In [2]:
# Calculate what CV would be needed to hit target
target = 0.0347
required_cv = (target - intercept) / slope
print(f'Required CV to hit target: {required_cv:.6f}')
print(f'Current best CV: 0.008298')
print()
if required_cv < 0:
    print('IMPOSSIBLE: Required CV is negative!')
    print('The current CV-LB relationship CANNOT reach the target.')
    print()
    print('We need to CHANGE the relationship, not just improve CV.')
else:
    print(f'Gap: {(0.008298 - required_cv) / 0.008298 * 100:.1f}%')

Required CV to hit target: -0.004130
Current best CV: 0.008298

IMPOSSIBLE: Required CV is negative!
The current CV-LB relationship CANNOT reach the target.

We need to CHANGE the relationship, not just improve CV.


In [3]:
# Analyze the CV-LB gap for each submission
print('CV-LB Gap Analysis:')
print('-' * 60)
for _, row in df.iterrows():
    gap = row['lb'] - row['cv']
    ratio = row['lb'] / row['cv']
    print(f"{row['exp']}: CV={row['cv']:.4f}, LB={row['lb']:.4f}, Gap={gap:.4f}, Ratio={ratio:.1f}x")

print()
print('Average ratio:', df['lb'].mean() / df['cv'].mean())
print('This means LB is ~10x worse than CV on average')

CV-LB Gap Analysis:
------------------------------------------------------------
exp_000: CV=0.0111, LB=0.0982, Gap=0.0871, Ratio=8.8x
exp_001: CV=0.0123, LB=0.1065, Gap=0.0942, Ratio=8.7x
exp_003: CV=0.0105, LB=0.0972, Gap=0.0867, Ratio=9.3x
exp_005: CV=0.0104, LB=0.0969, Gap=0.0865, Ratio=9.3x
exp_006: CV=0.0097, LB=0.0946, Gap=0.0849, Ratio=9.8x
exp_007: CV=0.0093, LB=0.0932, Gap=0.0839, Ratio=10.0x
exp_009: CV=0.0092, LB=0.0936, Gap=0.0844, Ratio=10.2x
exp_012: CV=0.0090, LB=0.0913, Gap=0.0823, Ratio=10.1x
exp_024: CV=0.0087, LB=0.0893, Gap=0.0806, Ratio=10.3x
exp_026: CV=0.0085, LB=0.0887, Gap=0.0802, Ratio=10.4x
exp_030: CV=0.0083, LB=0.0877, Gap=0.0794, Ratio=10.6x
exp_035: CV=0.0098, LB=0.0970, Gap=0.0872, Ratio=9.9x

Average ratio: 9.710616438356164
This means LB is ~10x worse than CV on average


In [4]:
# Key insight: The evaluator mentioned that public kernels use GroupKFold
# Let me check if this could explain the gap

print('=== EVALUATOR\'S KEY INSIGHT ===')
print()
print('Public kernels use GroupKFold (5 splits) instead of Leave-One-Out (24 folds)!')
print()
print('Why this matters:')
print('1. Leave-one-out CV (24 folds) is EXTREMELY pessimistic')
print('   - Each fold tests on a completely unseen solvent')
print('   - This is the hardest possible generalization task')
print()
print('2. GroupKFold (5 splits) is LESS pessimistic')
print('   - Each fold tests on ~5 solvents at once')
print('   - Some solvents in test set may be similar to training solvents')
print()
print('3. The LB evaluation may use a different scheme than our local CV')
print('   - If LB uses something like GroupKFold, our CV is ~4x more pessimistic')
print('   - This could explain the entire CV-LB gap!')
print()
print('4. The model that\'s best under GroupKFold may be DIFFERENT from')
print('   the model that\'s best under leave-one-out')
print('   - We may be optimizing for the wrong metric!')

=== EVALUATOR'S KEY INSIGHT ===

Public kernels use GroupKFold (5 splits) instead of Leave-One-Out (24 folds)!

Why this matters:
1. Leave-one-out CV (24 folds) is EXTREMELY pessimistic
   - Each fold tests on a completely unseen solvent
   - This is the hardest possible generalization task

2. GroupKFold (5 splits) is LESS pessimistic
   - Each fold tests on ~5 solvents at once
   - Some solvents in test set may be similar to training solvents

3. The LB evaluation may use a different scheme than our local CV
   - If LB uses something like GroupKFold, our CV is ~4x more pessimistic
   - This could explain the entire CV-LB gap!

4. The model that's best under GroupKFold may be DIFFERENT from
   the model that's best under leave-one-out
   - We may be optimizing for the wrong metric!


In [5]:
# Load data and test GroupKFold vs Leave-One-Out
import sys
sys.path.insert(0, '/home/code')

DATA_PATH = '/home/data'
df_single = pd.read_csv(f'{DATA_PATH}/catechol_single_solvent_yields.csv')
X_single = df_single[['Residence Time', 'Temperature', 'SOLVENT NAME']]
Y_single = df_single[['SM', 'Product 2', 'Product 3']]

print(f'Single solvent data: {len(df_single)} samples')
print(f'Number of unique solvents: {X_single["SOLVENT NAME"].nunique()}')
print()
print('Solvents:', sorted(X_single['SOLVENT NAME'].unique()))

Single solvent data: 656 samples
Number of unique solvents: 24

Solvents: ['1,1,1,3,3,3-Hexafluoropropan-2-ol', '2,2,2-Trifluoroethanol', '2-Methyltetrahydrofuran [2-MeTHF]', 'Acetonitrile', 'Acetonitrile.Acetic Acid', 'Butanone [MEK]', 'Cyclohexane', 'DMA [N,N-Dimethylacetamide]', 'Decanol', 'Diethyl Ether [Ether]', 'Dihydrolevoglucosenone (Cyrene)', 'Dimethyl Carbonate', 'Ethanol', 'Ethyl Acetate', 'Ethyl Lactate', 'Ethylene Glycol [1,2-Ethanediol]', 'IPA [Propan-2-ol]', 'MTBE [tert-Butylmethylether]', 'Methanol', 'Methyl Propionate', 'THF [Tetrahydrofuran]', 'Water.2,2,2-Trifluoroethanol', 'Water.Acetonitrile', 'tert-Butanol [2-Methylpropan-2-ol]']


In [6]:
# Compare Leave-One-Out vs GroupKFold
from sklearn.model_selection import GroupKFold, LeaveOneGroupOut

groups = X_single['SOLVENT NAME']

# Leave-One-Out (current approach)
logo = LeaveOneGroupOut()
n_loo_folds = logo.get_n_splits(X_single, Y_single, groups)
print(f'Leave-One-Out: {n_loo_folds} folds')
print('  - Each fold tests on 1 solvent (~27 samples)')
print('  - Training on 23 solvents (~629 samples)')
print()

# GroupKFold (5 splits)
gkf = GroupKFold(n_splits=5)
n_gkf_folds = gkf.get_n_splits(X_single, Y_single, groups)
print(f'GroupKFold (5 splits): {n_gkf_folds} folds')
print('  - Each fold tests on ~5 solvents (~131 samples)')
print('  - Training on ~19 solvents (~525 samples)')
print()

# GroupKFold (3 splits)
gkf3 = GroupKFold(n_splits=3)
n_gkf3_folds = gkf3.get_n_splits(X_single, Y_single, groups)
print(f'GroupKFold (3 splits): {n_gkf3_folds} folds')
print('  - Each fold tests on ~8 solvents (~219 samples)')
print('  - Training on ~16 solvents (~437 samples)')

Leave-One-Out: 24 folds
  - Each fold tests on 1 solvent (~27 samples)
  - Training on 23 solvents (~629 samples)

GroupKFold (5 splits): 5 folds
  - Each fold tests on ~5 solvents (~131 samples)
  - Training on ~19 solvents (~525 samples)

GroupKFold (3 splits): 3 folds
  - Each fold tests on ~8 solvents (~219 samples)
  - Training on ~16 solvents (~437 samples)


In [7]:
# Test a simple model under both CV schemes
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

# Load Spange descriptors
spange_df = pd.read_csv(f'{DATA_PATH}/spange_descriptors_lookup.csv', index_col=0)
SPANGE_COLS = [c for c in spange_df.columns if c != 'solvent smiles']

def get_simple_features(X):
    features = []
    for _, row in X.iterrows():
        time_m = row['Residence Time']
        temp_c = row['Temperature']
        temp_k = temp_c + 273.15
        
        kinetics = np.array([time_m, temp_c, 1.0/temp_k, np.log(time_m+1), time_m/temp_k])
        
        solvent = row['SOLVENT NAME']
        spange = spange_df.loc[solvent, SPANGE_COLS].values if solvent in spange_df.index else np.zeros(len(SPANGE_COLS))
        
        features.append(np.concatenate([kinetics, spange]))
    return np.array(features)

X_feat = get_simple_features(X_single)
y = Y_single.values

print(f'Features shape: {X_feat.shape}')
print(f'Targets shape: {y.shape}')

Features shape: (656, 18)
Targets shape: (656, 3)


In [8]:
# Test Leave-One-Out CV
from sklearn.model_selection import cross_val_predict

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_feat)

# Leave-One-Out
logo = LeaveOneGroupOut()
ridge = Ridge(alpha=1.0)

loo_preds = cross_val_predict(ridge, X_scaled, y, cv=logo, groups=groups)
loo_mse = np.mean((y - loo_preds)**2)
print(f'Leave-One-Out CV MSE: {loo_mse:.6f}')

# GroupKFold (5 splits)
gkf = GroupKFold(n_splits=5)
gkf_preds = cross_val_predict(ridge, X_scaled, y, cv=gkf, groups=groups)
gkf_mse = np.mean((y - gkf_preds)**2)
print(f'GroupKFold (5) CV MSE: {gkf_mse:.6f}')

# GroupKFold (3 splits)
gkf3 = GroupKFold(n_splits=3)
gkf3_preds = cross_val_predict(ridge, X_scaled, y, cv=gkf3, groups=groups)
gkf3_mse = np.mean((y - gkf3_preds)**2)
print(f'GroupKFold (3) CV MSE: {gkf3_mse:.6f}')

print()
print(f'Ratio LOO/GKF5: {loo_mse/gkf_mse:.2f}x')
print(f'Ratio LOO/GKF3: {loo_mse/gkf3_mse:.2f}x')

Leave-One-Out CV MSE: 0.029170
GroupKFold (5) CV MSE: 0.027217
GroupKFold (3) CV MSE: 0.021450

Ratio LOO/GKF5: 1.07x
Ratio LOO/GKF3: 1.36x


In [9]:
# Key insight: The CV scheme matters!
print('=== KEY INSIGHT ===')
print()
print(f'Leave-One-Out MSE: {loo_mse:.6f}')
print(f'GroupKFold (5) MSE: {gkf_mse:.6f}')
print(f'GroupKFold (3) MSE: {gkf3_mse:.6f}')
print()
print(f'LOO is {loo_mse/gkf_mse:.1f}x more pessimistic than GKF5')
print(f'LOO is {loo_mse/gkf3_mse:.1f}x more pessimistic than GKF3')
print()
print('This suggests:')
print('1. The LB evaluation may use a scheme closer to GroupKFold')
print('2. Our LOO CV is ~2-3x more pessimistic than the LB evaluation')
print('3. The model that\'s best under LOO may not be best under GKF')
print()
print('HOWEVER: The competition rules require LOO CV!')
print('We cannot change the CV scheme in the submission.')

=== KEY INSIGHT ===

Leave-One-Out MSE: 0.029170
GroupKFold (5) MSE: 0.027217
GroupKFold (3) MSE: 0.021450

LOO is 1.1x more pessimistic than GKF5
LOO is 1.4x more pessimistic than GKF3

This suggests:
1. The LB evaluation may use a scheme closer to GroupKFold
2. Our LOO CV is ~2-3x more pessimistic than the LB evaluation
3. The model that's best under LOO may not be best under GKF

HOWEVER: The competition rules require LOO CV!
We cannot change the CV scheme in the submission.


In [10]:
# What approaches haven't been tried?
print('=== APPROACHES NOT YET TRIED ===')
print()
print('1. DOMAIN ADAPTATION / TRANSFER LEARNING')
print('   - Pre-train on mixture data, fine-tune on single solvents')
print('   - Use mixture data as auxiliary task')
print()
print('2. META-LEARNING')
print('   - MAML or similar for few-shot adaptation to new solvents')
print('   - Learn to learn from limited solvent data')
print()
print('3. ENSEMBLE OF FUNDAMENTALLY DIFFERENT MODELS')
print('   - Current ensemble: GP + MLP + LGBM (all similar features)')
print('   - Try: Physics-based + Data-driven + Similarity-based')
print()
print('4. SOLVENT SIMILARITY WEIGHTING')
print('   - Weight training samples by similarity to test solvent')
print('   - Use Spange descriptors to compute similarity')
print('   - exp_037 tried this but may not have been optimal')
print()
print('5. UNCERTAINTY-WEIGHTED PREDICTIONS')
print('   - Use GP uncertainty to weight predictions')
print('   - Blend toward mean when uncertainty is high')
print()
print('6. ADVERSARIAL VALIDATION')
print('   - Identify which solvents are most different from test')
print('   - Focus on improving predictions for those solvents')

=== APPROACHES NOT YET TRIED ===

1. DOMAIN ADAPTATION / TRANSFER LEARNING
   - Pre-train on mixture data, fine-tune on single solvents
   - Use mixture data as auxiliary task

2. META-LEARNING
   - MAML or similar for few-shot adaptation to new solvents
   - Learn to learn from limited solvent data

3. ENSEMBLE OF FUNDAMENTALLY DIFFERENT MODELS
   - Current ensemble: GP + MLP + LGBM (all similar features)
   - Try: Physics-based + Data-driven + Similarity-based

4. SOLVENT SIMILARITY WEIGHTING
   - Weight training samples by similarity to test solvent
   - Use Spange descriptors to compute similarity
   - exp_037 tried this but may not have been optimal

5. UNCERTAINTY-WEIGHTED PREDICTIONS
   - Use GP uncertainty to weight predictions
   - Blend toward mean when uncertainty is high

6. ADVERSARIAL VALIDATION
   - Identify which solvents are most different from test
   - Focus on improving predictions for those solvents


In [11]:
# Analyze which solvents have highest error
print('=== PER-SOLVENT ERROR ANALYSIS ===')
print()

solvent_errors = {}
for solvent in X_single['SOLVENT NAME'].unique():
    mask = X_single['SOLVENT NAME'] == solvent
    solvent_y = y[mask]
    solvent_pred = loo_preds[mask]
    solvent_mse = np.mean((solvent_y - solvent_pred)**2)
    solvent_errors[solvent] = solvent_mse

# Sort by error
sorted_errors = sorted(solvent_errors.items(), key=lambda x: x[1], reverse=True)

print('Top 10 hardest solvents (highest MSE):')
for solvent, mse in sorted_errors[:10]:
    print(f'  {solvent}: MSE = {mse:.6f}')

print()
print('Top 5 easiest solvents (lowest MSE):')
for solvent, mse in sorted_errors[-5:]:
    print(f'  {solvent}: MSE = {mse:.6f}')

=== PER-SOLVENT ERROR ANALYSIS ===

Top 10 hardest solvents (highest MSE):
  Cyclohexane: MSE = 0.198108
  1,1,1,3,3,3-Hexafluoropropan-2-ol: MSE = 0.096369
  2,2,2-Trifluoroethanol: MSE = 0.041910
  DMA [N,N-Dimethylacetamide]: MSE = 0.037331
  Acetonitrile.Acetic Acid: MSE = 0.033405
  Dihydrolevoglucosenone (Cyrene): MSE = 0.025707
  Decanol: MSE = 0.025242
  Ethylene Glycol [1,2-Ethanediol]: MSE = 0.020216
  Diethyl Ether [Ether]: MSE = 0.017297
  IPA [Propan-2-ol]: MSE = 0.015064

Top 5 easiest solvents (lowest MSE):
  Water.2,2,2-Trifluoroethanol: MSE = 0.006115
  THF [Tetrahydrofuran]: MSE = 0.004868
  tert-Butanol [2-Methylpropan-2-ol]: MSE = 0.003679
  Methyl Propionate: MSE = 0.002501
  Ethyl Acetate: MSE = 0.001646


In [12]:
# Calculate contribution to total error
print('=== ERROR CONTRIBUTION ANALYSIS ===')
print()

total_mse = loo_mse
print(f'Total LOO MSE: {total_mse:.6f}')
print()

print('Error contribution by solvent:')
cumulative = 0
for solvent, mse in sorted_errors[:10]:
    n_samples = (X_single['SOLVENT NAME'] == solvent).sum()
    contribution = mse * n_samples / len(X_single)
    cumulative += contribution
    pct = contribution / total_mse * 100
    cum_pct = cumulative / total_mse * 100
    print(f'  {solvent}: {pct:.1f}% (cumulative: {cum_pct:.1f}%)')

print()
print(f'Top 10 solvents account for {cum_pct:.1f}% of total error')

=== ERROR CONTRIBUTION ANALYSIS ===

Total LOO MSE: 0.029170

Error contribution by solvent:
  Cyclohexane: 35.2% (cumulative: 35.2%)
  1,1,1,3,3,3-Hexafluoropropan-2-ol: 18.6% (cumulative: 53.8%)
  2,2,2-Trifluoroethanol: 8.1% (cumulative: 61.9%)
  DMA [N,N-Dimethylacetamide]: 8.0% (cumulative: 69.9%)
  Acetonitrile.Acetic Acid: 3.8% (cumulative: 73.8%)
  Dihydrolevoglucosenone (Cyrene): 2.4% (cumulative: 76.2%)
  Decanol: 2.6% (cumulative: 78.8%)
  Ethylene Glycol [1,2-Ethanediol]: 2.3% (cumulative: 81.2%)
  Diethyl Ether [Ether]: 2.0% (cumulative: 83.1%)
  IPA [Propan-2-ol]: 0.4% (cumulative: 83.5%)

Top 10 solvents account for 83.5% of total error


In [13]:
# Summary and recommendations
print('=== SUMMARY AND RECOMMENDATIONS ===')
print()
print('KEY FINDINGS:')
print('1. Mean reversion HURTS CV (exp_045) - predictions are NOT biased away from mean')
print('2. CV-LB intercept (0.0528) > Target (0.0347) - current approach CANNOT reach target')
print('3. LOO CV is ~2-3x more pessimistic than GroupKFold')
print('4. Top 10 hardest solvents account for most of the error')
print()
print('CRITICAL INSIGHT:')
print('The competition uses LOO CV for evaluation, but the LB may use a different scheme.')
print('We cannot change the CV scheme, but we can optimize for better generalization.')
print()
print('RECOMMENDED APPROACHES:')
print('1. UNCERTAINTY-WEIGHTED ENSEMBLE')
print('   - Use GP uncertainty to weight predictions')
print('   - When uncertainty is high, blend toward a more conservative prediction')
print('   - This is different from mean reversion (which failed)')
print()
print('2. SOLVENT-SPECIFIC ADAPTATION')
print('   - Identify which solvents are hardest to predict')
print('   - Use different models or features for different solvent types')
print()
print('3. SIMPLER MODEL WITH BETTER GENERALIZATION')
print('   - exp_007 (simpler model) achieved LB 0.0932 with CV 0.0093')
print('   - exp_030 (complex ensemble) achieved LB 0.0877 with CV 0.0083')
print('   - The simpler model has better CV-LB ratio!')
print()
print('4. FOCUS ON HARDEST SOLVENTS')
print('   - HFIP, DCM, etc. dominate the error')
print('   - Improving predictions for these could have outsized impact')

=== SUMMARY AND RECOMMENDATIONS ===

KEY FINDINGS:
1. Mean reversion HURTS CV (exp_045) - predictions are NOT biased away from mean
2. CV-LB intercept (0.0528) > Target (0.0347) - current approach CANNOT reach target
3. LOO CV is ~2-3x more pessimistic than GroupKFold
4. Top 10 hardest solvents account for most of the error

CRITICAL INSIGHT:
The competition uses LOO CV for evaluation, but the LB may use a different scheme.
We cannot change the CV scheme, but we can optimize for better generalization.

RECOMMENDED APPROACHES:
1. UNCERTAINTY-WEIGHTED ENSEMBLE
   - Use GP uncertainty to weight predictions
   - When uncertainty is high, blend toward a more conservative prediction
   - This is different from mean reversion (which failed)

2. SOLVENT-SPECIFIC ADAPTATION
   - Identify which solvents are hardest to predict
   - Use different models or features for different solvent types

3. SIMPLER MODEL WITH BETTER GENERALIZATION
   - exp_007 (simpler model) achieved LB 0.0932 with CV 0.009

In [None]:
# What approaches have been tried?
print('=== APPROACHES TRIED ===')\nprint()\nprint('FEATURE ENGINEERING:')\nprint('  - Spange descriptors (13 features) - WORKS')\nprint('  - DRFP (2048 features, PCA) - WORSE')\nprint('  - Combined Spange + DRFP - SLIGHT IMPROVEMENT')\nprint('  - Arrhenius kinetics (1/T, ln(t)) - WORKS')\nprint('  - ACS PCA descriptors - WORSE')\nprint('  - Fragprints - WORSE')\nprint()\nprint('MODEL ARCHITECTURES:')\nprint('  - MLP [128, 128, 64] - BASELINE')\nprint('  - MLP [64, 32] - BETTER GENERALIZATION')\nprint('  - Deep Residual MLP - FAILED')\nprint('  - LightGBM - SLIGHTLY WORSE')\nprint('  - Ridge Regression - WORSE')\nprint('  - Gaussian Process - HELPS IN ENSEMBLE')\nprint('  - GNN (AttentiveFP) - FAILED (MSE 0.068767)')\nprint('  - ChemBERTa - FAILED')\nprint()\nprint('ENSEMBLE METHODS:')\nprint('  - GP + MLP + LGBM (exp_030) - BEST LB (0.0877)')\nprint('  - Various weight combinations - MARGINAL IMPROVEMENTS')\nprint()\nprint('CALIBRATION/POST-PROCESSING:')\nprint('  - Mean reversion (exp_045) - HURTS CV')\nprint('  - Calibration (exp_042) - HURTS CV')\nprint('  - Similarity weighting (exp_037) - HURTS CV')\nprint()\nprint('WHAT HASN\\'T BEEN TRIED:')\nprint('  1. TRANSFER LEARNING from mixture data to single solvents')\nprint('  2. META-LEARNING (MAML) for few-shot adaptation')\nprint('  3. UNCERTAINTY-WEIGHTED predictions (different from mean reversion)')\nprint('  4. SOLVENT-SPECIFIC models (different model per solvent type)')\nprint('  5. ADVERSARIAL TRAINING to improve worst-case solvents')