# Loop 3 Analysis: Understanding the CV-LB Gap and Next Steps

## Key Observations:
1. GP model (exp_002) achieved CV 0.017921 - WORSE than tree-based (0.010986)
2. CV-LB gap is ~9x (CV 0.011 vs LB 0.0998)
3. Both submissions have nearly identical LB scores (~0.0998-0.0999)
4. Target of 0.017270 is 5.7x better than best public LB of 0.098

## Questions to Answer:
1. What is the actual LB metric? (MSE? MAE? Weighted?)
2. Why do different CV scores give similar LB scores?
3. What approaches haven't been tried?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Load our submissions
exp_000 = pd.read_csv('/home/code/experiments/001_baseline/submission.csv')
exp_001 = pd.read_csv('/home/code/experiments/002_tree_ensemble/submission.csv')
exp_002 = pd.read_csv('/home/code/experiments/003_gaussian_process/submission.csv')

print('Submission shapes:')
print(f'exp_000: {exp_000.shape}')
print(f'exp_001: {exp_001.shape}')
print(f'exp_002: {exp_002.shape}')

Submission shapes:
exp_000: (1883, 8)
exp_001: (1883, 8)
exp_002: (1883, 8)


In [2]:
# Compare predictions between experiments
print('\n=== Prediction Statistics ===')
for name, df in [('exp_000 (MLP)', exp_000), ('exp_001 (Trees)', exp_001), ('exp_002 (GP)', exp_002)]:
    print(f'\n{name}:')
    for col in ['target_1', 'target_2', 'target_3']:
        print(f'  {col}: mean={df[col].mean():.4f}, std={df[col].std():.4f}, min={df[col].min():.4f}, max={df[col].max():.4f}')


=== Prediction Statistics ===

exp_000 (MLP):
  target_1: mean=0.1626, std=0.1392, min=0.0001, max=0.4477
  target_2: mean=0.1392, std=0.1279, min=0.0001, max=0.4257
  target_3: mean=0.5162, std=0.3495, min=0.0000, max=0.9983

exp_001 (Trees):
  target_1: mean=0.1560, std=0.1307, min=0.0000, max=0.4402
  target_2: mean=0.1425, std=0.1193, min=0.0000, max=0.4130
  target_3: mean=0.5074, std=0.3581, min=0.0000, max=1.0000

exp_002 (GP):
  target_1: mean=0.1410, std=0.1177, min=0.0000, max=0.4018
  target_2: mean=0.1282, std=0.1022, min=0.0000, max=0.4345
  target_3: mean=0.5473, std=0.3381, min=0.0000, max=1.0000


In [3]:
# Compare predictions between MLP and Trees
print('\n=== Prediction Differences (MLP vs Trees) ===')
for col in ['target_1', 'target_2', 'target_3']:
    diff = (exp_000[col] - exp_001[col]).abs()
    print(f'{col}: mean_diff={diff.mean():.6f}, max_diff={diff.max():.6f}')

print('\n=== Prediction Differences (MLP vs GP) ===')
for col in ['target_1', 'target_2', 'target_3']:
    diff = (exp_000[col] - exp_002[col]).abs()
    print(f'{col}: mean_diff={diff.mean():.6f}, max_diff={diff.max():.6f}')


=== Prediction Differences (MLP vs Trees) ===
target_1: mean_diff=0.028663, max_diff=0.165259
target_2: mean_diff=0.030274, max_diff=0.191224
target_3: mean_diff=0.061630, max_diff=0.441261

=== Prediction Differences (MLP vs GP) ===
target_1: mean_diff=0.045911, max_diff=0.304992
target_2: mean_diff=0.051088, max_diff=0.266612
target_3: mean_diff=0.068739, max_diff=0.524555


In [4]:
# Load actual data to compute per-target errors
DATA_PATH = '/home/data'

df_single = pd.read_csv(f'{DATA_PATH}/catechol_single_solvent_yields.csv')
df_full = pd.read_csv(f'{DATA_PATH}/catechol_full_data_yields.csv')

print('Data shapes:')
print(f'Single solvent: {df_single.shape}')
print(f'Full data: {df_full.shape}')

# Target statistics
print('\n=== Target Statistics (Single Solvent) ===')
for col in ['Product 2', 'Product 3', 'SM']:
    print(f'{col}: mean={df_single[col].mean():.4f}, std={df_single[col].std():.4f}')

print('\n=== Target Statistics (Full Data) ===')
for col in ['Product 2', 'Product 3', 'SM']:
    print(f'{col}: mean={df_full[col].mean():.4f}, std={df_full[col].std():.4f}')

Data shapes:
Single solvent: (656, 13)
Full data: (1227, 19)

=== Target Statistics (Single Solvent) ===
Product 2: mean=0.1499, std=0.1431
Product 3: mean=0.1234, std=0.1315
SM: mean=0.5222, std=0.3602

=== Target Statistics (Full Data) ===
Product 2: mean=0.1646, std=0.1535
Product 3: mean=0.1437, std=0.1458
SM: mean=0.4952, std=0.3794


In [5]:
# Analyze the CV-LB gap
print('\n=== CV-LB Gap Analysis ===')
print('\nSubmission History:')
print('exp_000 (MLP): CV=0.0113, LB=0.0998 → gap=8.8x')
print('exp_001 (Trees): CV=0.0110, LB=0.0999 → gap=9.1x')

print('\nKey Insight:')
print('Both submissions have nearly identical LB scores (~0.0998-0.0999)')
print('despite different CV scores (0.0113 vs 0.0110).')
print('This suggests the LB metric may be different from our CV MSE calculation.')

print('\nPossible explanations:')
print('1. LB uses a different metric (MAE? weighted MSE?)')
print('2. LB only evaluates a subset of predictions')
print('3. LB weights tasks differently (single vs full)')
print('4. Our models are overfitting to CV procedure')


=== CV-LB Gap Analysis ===

Submission History:
exp_000 (MLP): CV=0.0113, LB=0.0998 → gap=8.8x
exp_001 (Trees): CV=0.0110, LB=0.0999 → gap=9.1x

Key Insight:
Both submissions have nearly identical LB scores (~0.0998-0.0999)
despite different CV scores (0.0113 vs 0.0110).
This suggests the LB metric may be different from our CV MSE calculation.

Possible explanations:
1. LB uses a different metric (MAE? weighted MSE?)
2. LB only evaluates a subset of predictions
3. LB weights tasks differently (single vs full)
4. Our models are overfitting to CV procedure


In [6]:
# Check if the target of 0.017270 is achievable
print('\n=== Target Analysis ===')
print(f'Target score: 0.017270')
print(f'Best public LB: ~0.098')
print(f'Gap: {0.098 / 0.017270:.1f}x')

print('\nIf target is MSE:')
print(f'  Target RMSE: {np.sqrt(0.017270):.4f} = {np.sqrt(0.017270)*100:.2f}%')
print(f'  Best public RMSE: {np.sqrt(0.098):.4f} = {np.sqrt(0.098)*100:.2f}%')

print('\nIf target is MAE:')
print(f'  Target MAE: 0.0173 = 1.73%')
print(f'  Best public MAE: 0.098 = 9.8%')

print('\nConclusion:')
print('The target of 0.017270 is MUCH better than the best public LB.')
print('This suggests either:')
print('1. A fundamentally different approach is needed')
print('2. The target represents a different evaluation')
print('3. There is a breakthrough technique we have not discovered')


=== Target Analysis ===
Target score: 0.017270
Best public LB: ~0.098
Gap: 5.7x

If target is MSE:
  Target RMSE: 0.1314 = 13.14%
  Best public RMSE: 0.3130 = 31.30%

If target is MAE:
  Target MAE: 0.0173 = 1.73%
  Best public MAE: 0.098 = 9.8%

Conclusion:
The target of 0.017270 is MUCH better than the best public LB.
This suggests either:
1. A fundamentally different approach is needed
2. The target represents a different evaluation
3. There is a breakthrough technique we have not discovered


In [7]:
# What approaches haven't been tried?
print('\n=== Unexplored Approaches ===')
print('\n1. FEATURE ENGINEERING:')
print('   - drfps (2048 features) - differential reaction fingerprints')
print('   - fragprints (2133 features) - fragment + fingerprint')
print('   - Custom chemistry features (reaction SMILES)')
print('   - Feature selection/PCA on high-dim features')

print('\n2. MODEL ARCHITECTURES:')
print('   - Stacking ensemble (MLP + Trees + GP)')
print('   - Regressor chains (feed one prediction as input to next)')
print('   - Multi-output models with shared representations')
print('   - Transformer-based models for SMILES')

print('\n3. TRAINING STRATEGIES:')
print('   - Per-target optimization (different models for SM vs Products)')
print('   - Task-specific models (different for single vs full)')
print('   - Adversarial validation to identify distribution shift')
print('   - Pseudo-labeling or self-training')

print('\n4. POST-PROCESSING:')
print('   - Curds & Whey decorrelation')
print('   - Prediction calibration')
print('   - Ensemble blending with optimized weights')


=== Unexplored Approaches ===

1. FEATURE ENGINEERING:
   - drfps (2048 features) - differential reaction fingerprints
   - fragprints (2133 features) - fragment + fingerprint
   - Custom chemistry features (reaction SMILES)
   - Feature selection/PCA on high-dim features

2. MODEL ARCHITECTURES:
   - Stacking ensemble (MLP + Trees + GP)
   - Regressor chains (feed one prediction as input to next)
   - Multi-output models with shared representations
   - Transformer-based models for SMILES

3. TRAINING STRATEGIES:
   - Per-target optimization (different models for SM vs Products)
   - Task-specific models (different for single vs full)
   - Adversarial validation to identify distribution shift
   - Pseudo-labeling or self-training

4. POST-PROCESSING:
   - Curds & Whey decorrelation
   - Prediction calibration
   - Ensemble blending with optimized weights


In [8]:
# Key insight from public kernels
print('\n=== Key Insights from Public Kernels ===')
print('\n1. Arrhenius Kinetics + TTA (LB 0.09831):')
print('   - Uses inv_temp, log_time, interaction features')
print('   - Symmetry TTA for mixed solvents')
print('   - 7 models for bagging')
print('   - HuberLoss for robustness')

print('\n2. Per-Target Ensemble (LB 0.11161):')
print('   - Different models for SM vs Products')
print('   - HGB for SM, ExtraTrees for Products')
print('   - Weighted ensemble (0.65 ACS PCA + 0.35 Spange)')
print('   - Uses both feature tables')

print('\n3. System Malfunction V1 (29 votes):')
print('   - Simple MLP with BatchNorm')
print('   - Uses spange_descriptors only')
print('   - No Arrhenius features')

print('\nKey Takeaway:')
print('The best public kernels achieve LB ~0.098-0.11.')
print('To reach target 0.017, we need a 5-6x improvement.')
print('This is a HUGE gap that requires a breakthrough.')


=== Key Insights from Public Kernels ===

1. Arrhenius Kinetics + TTA (LB 0.09831):
   - Uses inv_temp, log_time, interaction features
   - Symmetry TTA for mixed solvents
   - 7 models for bagging
   - HuberLoss for robustness

2. Per-Target Ensemble (LB 0.11161):
   - Different models for SM vs Products
   - HGB for SM, ExtraTrees for Products
   - Weighted ensemble (0.65 ACS PCA + 0.35 Spange)
   - Uses both feature tables

3. System Malfunction V1 (29 votes):
   - Simple MLP with BatchNorm
   - Uses spange_descriptors only
   - No Arrhenius features

Key Takeaway:
The best public kernels achieve LB ~0.098-0.11.
To reach target 0.017, we need a 5-6x improvement.
This is a HUGE gap that requires a breakthrough.


In [9]:
# Recommended next steps
print('\n=== RECOMMENDED NEXT STEPS ===')
print('\nPRIORITY 1: Stacking Ensemble')
print('   - Combine MLP + Trees predictions')
print('   - Use meta-learner to blend')
print('   - Leverage diversity between model families')

print('\nPRIORITY 2: High-Dimensional Features')
print('   - Try drfps (2048 features) with dimensionality reduction')
print('   - Try fragprints (2133 features) with feature selection')
print('   - These may capture chemistry better than spange')

print('\nPRIORITY 3: Per-Target Optimization')
print('   - SM has highest variance - focus on improving SM predictions')
print('   - Different models/features for each target')
print('   - Regressor chains to capture target correlations')

print('\nPRIORITY 4: Investigate LB Metric')
print('   - The CV-LB gap is consistent (~9x)')
print('   - Need to understand what the LB actually measures')
print('   - Consider that LB may weight tasks/targets differently')


=== RECOMMENDED NEXT STEPS ===

PRIORITY 1: Stacking Ensemble
   - Combine MLP + Trees predictions
   - Use meta-learner to blend
   - Leverage diversity between model families

PRIORITY 2: High-Dimensional Features
   - Try drfps (2048 features) with dimensionality reduction
   - Try fragprints (2133 features) with feature selection
   - These may capture chemistry better than spange

PRIORITY 3: Per-Target Optimization
   - SM has highest variance - focus on improving SM predictions
   - Different models/features for each target
   - Regressor chains to capture target correlations

PRIORITY 4: Investigate LB Metric
   - The CV-LB gap is consistent (~9x)
   - Need to understand what the LB actually measures
   - Consider that LB may weight tasks/targets differently
