# Loop 6 Analysis: CV-LB Gap and Strategy Assessment

## Key Questions:
1. Why is the CV-LB gap ~9x consistently?
2. What approaches might reduce this gap?
3. Is the target (0.023) achievable with MLP-based approaches?

In [1]:
import pandas as pd
import numpy as np
import json

# Load session state to analyze experiments
with open('/home/code/session_state.json', 'r') as f:
    state = json.load(f)

# Analyze submissions
print('=== SUBMISSION ANALYSIS ===')
submissions = state.get('submissions', [])
for sub in submissions:
    cv = sub.get('cv_score', 'N/A')
    lb = sub.get('lb_score', 'N/A')
    exp_id = sub.get('experiment_id', 'N/A')
    if cv != 'N/A' and lb != 'N/A':
        ratio = lb / cv
        print(f'{exp_id}: CV={cv:.4f}, LB={lb:.4f}, Ratio={ratio:.1f}x')
    else:
        print(f'{exp_id}: CV={cv}, LB={lb}')

=== SUBMISSION ANALYSIS ===
exp_000: CV=0.0111, LB=0.0982, Ratio=8.9x
exp_001: CV=0.0123, LB=0.1065, Ratio=8.7x
exp_003: CV=0.0105, LB=0.0972, Ratio=9.3x


In [2]:
# Analyze all experiments
print('\n=== EXPERIMENT SCORES ===')
experiments = state.get('experiments', [])
for exp in experiments:
    print(f"{exp['id']}: {exp['name'][:50]:50s} CV={exp['score']:.6f}")

# Best CV
best_cv = min(exp['score'] for exp in experiments)
print(f'\nBest CV: {best_cv:.6f}')

# With 9x ratio, predicted LB
predicted_lb = best_cv * 9
print(f'Predicted LB (9x ratio): {predicted_lb:.4f}')
print(f'Target: 0.023')
print(f'Gap to target: {(predicted_lb - 0.023) / 0.023 * 100:.1f}%')


=== EXPERIMENT SCORES ===
exp_000: Baseline MLP with Arrhenius Kinetics + TTA         CV=0.011081
exp_001: LightGBM Baseline with Arrhenius Kinetics + TTA    CV=0.012297
exp_002: DRFP MLP with PCA (100 components)                 CV=0.016948
exp_003: Combined Spange + DRFP (high-variance) + Arrhenius CV=0.010501
exp_004: Deep Residual MLP with Large Ensemble (FAILED)     CV=0.051912
exp_005: Large Ensemble (15 models) with Same Architecture  CV=0.010430

Best CV: 0.010430
Predicted LB (9x ratio): 0.0939
Target: 0.023
Gap to target: 308.1%


In [3]:
# Analyze the CV-LB gap more carefully
print('\n=== CV-LB GAP ANALYSIS ===')

# Known submissions
submission_data = [
    ('exp_000', 0.0111, 0.0982),
    ('exp_001', 0.0123, 0.1065),
    ('exp_003', 0.0105, 0.0972),
]

for exp_id, cv, lb in submission_data:
    ratio = lb / cv
    print(f'{exp_id}: CV={cv:.4f}, LB={lb:.4f}, Ratio={ratio:.2f}x')

avg_ratio = np.mean([lb/cv for _, cv, lb in submission_data])
print(f'\nAverage CV-LB ratio: {avg_ratio:.2f}x')

# To beat target 0.023
target = 0.023
required_cv = target / avg_ratio
print(f'\nTo beat target {target}:')
print(f'  Required CV: {required_cv:.6f}')
print(f'  Current best CV: {best_cv:.6f}')
print(f'  Improvement needed: {(best_cv - required_cv) / best_cv * 100:.1f}%')


=== CV-LB GAP ANALYSIS ===
exp_000: CV=0.0111, LB=0.0982, Ratio=8.85x
exp_001: CV=0.0123, LB=0.1065, Ratio=8.66x
exp_003: CV=0.0105, LB=0.0972, Ratio=9.26x

Average CV-LB ratio: 8.92x

To beat target 0.023:
  Required CV: 0.002578
  Current best CV: 0.010430
  Improvement needed: 75.3%


In [4]:
# Analyze what's causing the CV-LB gap
print('\n=== POTENTIAL CAUSES OF CV-LB GAP ===')
print('''
1. LEAVE-ONE-SOLVENT-OUT GENERALIZATION:
   - CV tests on solvents seen during training (other folds)
   - LB tests on completely unseen solvents
   - This is fundamentally harder than interpolation

2. DISTRIBUTION SHIFT:
   - Training solvents may have different properties than test solvents
   - Model may be overfitting to training solvent characteristics

3. MODEL VARIANCE:
   - Neural networks have high variance between runs
   - Different random seeds give different predictions
   - Larger ensembles reduce this but don't eliminate it

4. FEATURE REPRESENTATION:
   - Spange descriptors may not capture all relevant solvent properties
   - DRFP may not generalize well to unseen solvents
   - Linear mixing for mixtures may be too simplistic
''')


=== POTENTIAL CAUSES OF CV-LB GAP ===

1. LEAVE-ONE-SOLVENT-OUT GENERALIZATION:
   - CV tests on solvents seen during training (other folds)
   - LB tests on completely unseen solvents
   - This is fundamentally harder than interpolation

2. DISTRIBUTION SHIFT:
   - Training solvents may have different properties than test solvents
   - Model may be overfitting to training solvent characteristics

3. MODEL VARIANCE:
   - Neural networks have high variance between runs
   - Different random seeds give different predictions
   - Larger ensembles reduce this but don't eliminate it

4. FEATURE REPRESENTATION:
   - Spange descriptors may not capture all relevant solvent properties
   - DRFP may not generalize well to unseen solvents
   - Linear mixing for mixtures may be too simplistic



In [5]:
# Load and analyze the data to understand the problem better
DATA_PATH = '/home/data'

df_single = pd.read_csv(f'{DATA_PATH}/catechol_single_solvent_yields.csv')
df_full = pd.read_csv(f'{DATA_PATH}/catechol_full_data_yields.csv')

print('=== DATA OVERVIEW ===')
print(f'Single solvent samples: {len(df_single)}')
print(f'Full data samples: {len(df_full)}')
print(f'Total samples: {len(df_single) + len(df_full)}')

print(f'\nUnique solvents (single): {df_single["SOLVENT NAME"].nunique()}')
print(f'Unique solvent pairs (full): {len(df_full[["SOLVENT A NAME", "SOLVENT B NAME"]].drop_duplicates())}')

=== DATA OVERVIEW ===
Single solvent samples: 656
Full data samples: 1227
Total samples: 1883

Unique solvents (single): 24
Unique solvent pairs (full): 13


In [6]:
# Analyze target distributions
print('\n=== TARGET DISTRIBUTIONS ===')
for col in ['Product 2', 'Product 3', 'SM']:
    print(f'\n{col}:')
    print(f'  Single - Mean: {df_single[col].mean():.4f}, Std: {df_single[col].std():.4f}')
    print(f'  Full   - Mean: {df_full[col].mean():.4f}, Std: {df_full[col].std():.4f}')


=== TARGET DISTRIBUTIONS ===

Product 2:
  Single - Mean: 0.1499, Std: 0.1431
  Full   - Mean: 0.1646, Std: 0.1535

Product 3:
  Single - Mean: 0.1234, Std: 0.1315
  Full   - Mean: 0.1437, Std: 0.1458

SM:
  Single - Mean: 0.5222, Std: 0.3602
  Full   - Mean: 0.4952, Std: 0.3794


In [7]:
# Analyze per-solvent variance in single solvent data
print('\n=== PER-SOLVENT VARIANCE (Single Solvent) ===')
solvent_stats = df_single.groupby('SOLVENT NAME')[['Product 2', 'Product 3', 'SM']].agg(['mean', 'std'])
print(solvent_stats.head(10))

# Which solvents have highest variance?
print('\n=== SOLVENTS WITH HIGHEST SM VARIANCE ===')
sm_std = df_single.groupby('SOLVENT NAME')['SM'].std().sort_values(ascending=False)
print(sm_std.head(10))


=== PER-SOLVENT VARIANCE (Single Solvent) ===
                                  Product 2           Product 3            \
                                       mean       std      mean       std   
SOLVENT NAME                                                                
1,1,1,3,3,3-Hexafluoropropan-2-ol  0.319727  0.092840  0.285405  0.071676   
2,2,2-Trifluoroethanol             0.156764  0.073588  0.050041  0.055906   
2-Methyltetrahydrofuran [2-MeTHF]  0.150618  0.145866  0.100640  0.088436   
Acetonitrile                       0.156390  0.147598  0.088966  0.090181   
Acetonitrile.Acetic Acid           0.019341  0.011275  0.020625  0.008009   
Butanone [MEK]                     0.047166  0.018425  0.042997  0.016497   
Cyclohexane                        0.083866  0.089318  0.049289  0.050019   
DMA [N,N-Dimethylacetamide]        0.117138  0.126535  0.097634  0.102628   
Decanol                            0.194795  0.174740  0.207992  0.184415   
Diethyl Ether [Ether]        

In [8]:
# Key insight: The problem is fundamentally about generalizing to unseen solvents
print('\n=== KEY INSIGHTS ===')
print('''
1. The CV-LB gap (~9x) is consistent across all experiments
   - This suggests the gap is inherent to the problem, not the model
   - Leave-one-solvent-out CV is fundamentally different from LB evaluation

2. Larger ensembles (15 vs 5 models) only improved CV by 0.7%
   - Variance reduction has diminishing returns
   - The gap is NOT primarily due to model variance

3. The target (0.023) may be unrealistic for MLP approaches
   - GNN benchmark achieved 0.0039 using graph neural networks
   - Our best LB (0.0972) is competitive for MLP approaches
   - To beat 0.023 with 9x ratio, need CV ~0.0026 (75% improvement)

4. Potential strategies to reduce CV-LB gap:
   a. Simpler models (less overfitting to training solvents)
   b. Better regularization (L1/L2, dropout)
   c. Feature engineering (more generalizable features)
   d. Per-target models (different patterns for different targets)
   e. Gaussian Processes (better uncertainty quantification)
''')


=== KEY INSIGHTS ===

1. The CV-LB gap (~9x) is consistent across all experiments
   - This suggests the gap is inherent to the problem, not the model
   - Leave-one-solvent-out CV is fundamentally different from LB evaluation

2. Larger ensembles (15 vs 5 models) only improved CV by 0.7%
   - Variance reduction has diminishing returns
   - The gap is NOT primarily due to model variance

3. The target (0.023) may be unrealistic for MLP approaches
   - GNN benchmark achieved 0.0039 using graph neural networks
   - Our best LB (0.0972) is competitive for MLP approaches
   - To beat 0.023 with 9x ratio, need CV ~0.0026 (75% improvement)

4. Potential strategies to reduce CV-LB gap:
   a. Simpler models (less overfitting to training solvents)
   b. Better regularization (L1/L2, dropout)
   c. Feature engineering (more generalizable features)
   d. Per-target models (different patterns for different targets)
   e. Gaussian Processes (better uncertainty quantification)



In [9]:
# Analyze the latest experiment (exp_005) results
print('\n=== LATEST EXPERIMENT ANALYSIS (exp_005) ===')
print('''
exp_005: Large Ensemble (15 models) with Same Architecture
- Single Solvent MSE: 0.011533 (slightly worse than exp_003)
- Full Data MSE: 0.009841 (slightly better than exp_003)
- Overall MSE: 0.010430 (0.7% better than exp_003)

Key observations:
1. Larger ensemble helped more for mixtures than single solvents
2. Marginal improvement suggests we're near the ceiling
3. Training time increased 3x (6.5h vs 2h) for 0.7% improvement

This experiment has NOT been submitted to LB yet.
With 2 submissions remaining, we should:
1. Submit exp_005 to validate if variance reduction helps on LB
2. If LB doesn't improve, pivot to fundamentally different approaches
''')


=== LATEST EXPERIMENT ANALYSIS (exp_005) ===

exp_005: Large Ensemble (15 models) with Same Architecture
- Single Solvent MSE: 0.011533 (slightly worse than exp_003)
- Full Data MSE: 0.009841 (slightly better than exp_003)
- Overall MSE: 0.010430 (0.7% better than exp_003)

Key observations:
1. Larger ensemble helped more for mixtures than single solvents
2. Marginal improvement suggests we're near the ceiling
3. Training time increased 3x (6.5h vs 2h) for 0.7% improvement

This experiment has NOT been submitted to LB yet.
With 2 submissions remaining, we should:
1. Submit exp_005 to validate if variance reduction helps on LB
2. If LB doesn't improve, pivot to fundamentally different approaches



In [10]:
# Final recommendation
print('\n=== STRATEGIC RECOMMENDATION ===')
print('''
Given the analysis:

1. SUBMIT exp_005 to LB (Priority 1)
   - Validate if larger ensemble reduces CV-LB gap
   - If LB improves proportionally to CV, continue this direction
   - If not, the gap is NOT due to model variance

2. If LB doesn't improve, try SIMPLER models (Priority 2)
   - Reduce model complexity to prevent overfitting
   - Try MLP [64, 32] or even linear models
   - May generalize better to unseen solvents

3. Try per-target models (Priority 3)
   - SM, Product 2, Product 3 may have different optimal patterns
   - Competition allows different hyperparameters for different objectives

4. Consider Gaussian Processes (Priority 4)
   - Better uncertainty quantification
   - May handle small data better
   - Tanimoto kernel for molecular similarity

REALITY CHECK:
- Target (0.023) is 4x better than our best LB (0.0972)
- With 9x CV-LB ratio, would need CV ~0.0026
- This is 75% improvement from current best CV (0.0104)
- May be unrealistic for MLP approaches
- Focus on closing the CV-LB gap rather than improving CV
''')


=== STRATEGIC RECOMMENDATION ===

Given the analysis:

1. SUBMIT exp_005 to LB (Priority 1)
   - Validate if larger ensemble reduces CV-LB gap
   - If LB improves proportionally to CV, continue this direction
   - If not, the gap is NOT due to model variance

2. If LB doesn't improve, try SIMPLER models (Priority 2)
   - Reduce model complexity to prevent overfitting
   - Try MLP [64, 32] or even linear models
   - May generalize better to unseen solvents

3. Try per-target models (Priority 3)
   - SM, Product 2, Product 3 may have different optimal patterns
   - Competition allows different hyperparameters for different objectives

4. Consider Gaussian Processes (Priority 4)
   - Better uncertainty quantification
   - May handle small data better
   - Tanimoto kernel for molecular similarity

REALITY CHECK:
- Target (0.023) is 4x better than our best LB (0.0972)
- With 9x CV-LB ratio, would need CV ~0.0026
- This is 75% improvement from current best CV (0.0104)
- May be unrealistic f