# Loop 16 Analysis: Understanding the Dual-Model Ensemble Architecture

**Key Finding from Evaluator:**
- exp_004 (CV 0.0623) trains SEPARATE models on spange AND acs_pca features, then combines PREDICTIONS
- exp_016 (CV 0.0830) combines FEATURES first, then trains single models

This is a fundamental architectural difference that explains the 33% performance gap!

In [1]:
import numpy as np
import pandas as pd

# Load data
DATA_PATH = '/home/data'
Spange = pd.read_csv(f'{DATA_PATH}/spange_descriptors_lookup.csv')
ACS_PCA = pd.read_csv(f'{DATA_PATH}/acs_pca_descriptors_lookup.csv')

print(f'Spange features: {Spange.shape[1]-1} dimensions')
print(f'ACS_PCA features: {ACS_PCA.shape[1]-1} dimensions')
print(f'\nSpange columns: {list(Spange.columns)}')
print(f'\nACS_PCA columns: {list(ACS_PCA.columns)}')

Spange features: 13 dimensions
ACS_PCA features: 5 dimensions

Spange columns: ['SOLVENT NAME', 'dielectric constant', 'ET(30)', 'alpha', 'beta', 'pi*', 'SA', 'SB', 'SP', 'SdP', 'N', 'n', 'f(n)', 'delta']

ACS_PCA columns: ['SOLVENT NAME', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5']


In [2]:
# Analyze experiment history
experiments = [
    ('exp_004', 0.0623, 0.0956, 'Per-Target (HGB+ETR) NO TTA - SEPARATE models, PREDICTION combination'),
    ('exp_006', 0.0688, 0.0991, 'Per-Target depth=5/7 - COMBINED features'),
    ('exp_010', 0.0669, None, 'MLP + GBDT Ensemble'),
    ('exp_013', 0.0827, None, 'LOO Ensemble'),
    ('exp_014', 0.0834, None, 'Optuna Per-Target'),
    ('exp_015', 0.0891, None, 'MLP + Per-Target COMBINED'),
    ('exp_016', 0.0830, None, 'Hybrid Task-Specific - COMBINED features'),
]

print('Experiment Analysis:')
print('='*80)
for exp, cv, lb, desc in experiments:
    lb_str = f'{lb:.4f}' if lb else 'N/A'
    gap = f'{(lb-cv)/cv*100:.1f}%' if lb else 'N/A'
    print(f'{exp}: CV={cv:.4f}, LB={lb_str}, Gap={gap}')
    print(f'  -> {desc}')
    print()

Experiment Analysis:
exp_004: CV=0.0623, LB=0.0956, Gap=53.5%
  -> Per-Target (HGB+ETR) NO TTA - SEPARATE models, PREDICTION combination

exp_006: CV=0.0688, LB=0.0991, Gap=44.0%
  -> Per-Target depth=5/7 - COMBINED features

exp_010: CV=0.0669, LB=N/A, Gap=N/A
  -> MLP + GBDT Ensemble

exp_013: CV=0.0827, LB=N/A, Gap=N/A
  -> LOO Ensemble

exp_014: CV=0.0834, LB=N/A, Gap=N/A
  -> Optuna Per-Target

exp_015: CV=0.0891, LB=N/A, Gap=N/A
  -> MLP + Per-Target COMBINED

exp_016: CV=0.0830, LB=N/A, Gap=N/A
  -> Hybrid Task-Specific - COMBINED features



In [3]:
# KEY INSIGHT: exp_004's architecture
print('='*80)
print('CRITICAL ARCHITECTURAL DIFFERENCE')
print('='*80)
print()
print('exp_004 (CV 0.0623 - BEST):')
print('  1. Train HGB on spange features -> spange_pred')
print('  2. Train HGB on acs_pca features -> acs_pred')
print('  3. Final prediction = 0.8 * acs_pred + 0.2 * spange_pred')
print()
print('exp_016 (CV 0.0830 - WORSE):')
print('  1. Combine features: combined = 0.8 * acs + 0.2 * spange')
print('  2. Train single HGB on combined features -> pred')
print()
print('WHY PREDICTION COMBINATION IS BETTER:')
print('  - Each model specializes in its feature space')
print('  - Ensemble of diverse models reduces variance')
print('  - Feature combination loses information through averaging')

CRITICAL ARCHITECTURAL DIFFERENCE

exp_004 (CV 0.0623 - BEST):
  1. Train HGB on spange features -> spange_pred
  2. Train HGB on acs_pca features -> acs_pred
  3. Final prediction = 0.8 * acs_pred + 0.2 * spange_pred

exp_016 (CV 0.0830 - WORSE):
  1. Combine features: combined = 0.8 * acs + 0.2 * spange
  2. Train single HGB on combined features -> pred

WHY PREDICTION COMBINATION IS BETTER:
  - Each model specializes in its feature space
  - Ensemble of diverse models reduces variance
  - Feature combination loses information through averaging


In [4]:
# Calculate the expected improvement
print('='*80)
print('EXPECTED IMPROVEMENT FROM FIXING ARCHITECTURE')
print('='*80)

# exp_016 results
single_016 = 0.0647
full_016 = 0.0928

# exp_004 results (target)
single_004 = 0.0659
full_004 = 0.0603

print(f'\nSingle Solvent:')
print(f'  exp_016: {single_016:.4f}')
print(f'  exp_004: {single_004:.4f}')
print(f'  exp_016 is {(single_016-single_004)/single_004*100:.1f}% better (already good)')

print(f'\nFull Data:')
print(f'  exp_016: {full_016:.4f}')
print(f'  exp_004: {full_004:.4f}')
print(f'  exp_016 is {(full_016-full_004)/full_004*100:.1f}% WORSE (need to fix!)')

# If we fix full data to match exp_004
expected_combined = single_016 * 0.35 + full_004 * 0.65
print(f'\nExpected Combined if we fix full data:')
print(f'  {single_016:.4f} * 0.35 + {full_004:.4f} * 0.65 = {expected_combined:.4f}')
print(f'  This would be {(0.0623-expected_combined)/0.0623*100:.1f}% better than exp_004!')

EXPECTED IMPROVEMENT FROM FIXING ARCHITECTURE

Single Solvent:
  exp_016: 0.0647
  exp_004: 0.0659
  exp_016 is -1.8% better (already good)

Full Data:
  exp_016: 0.0928
  exp_004: 0.0603
  exp_016 is 53.9% WORSE (need to fix!)

Expected Combined if we fix full data:
  0.0647 * 0.35 + 0.0603 * 0.65 = 0.0618
  This would be 0.7% better than exp_004!


In [5]:
# Analyze CV-LB gap
print('='*80)
print('CV-LB GAP ANALYSIS')
print('='*80)

submissions = [
    ('exp_004', 0.0623, 0.0956),
    ('exp_006', 0.0688, 0.0991),
]

for exp, cv, lb in submissions:
    gap = (lb - cv) / cv * 100
    print(f'{exp}: CV={cv:.4f}, LB={lb:.4f}, Gap={gap:.1f}%')

print(f'\nThe CV-LB gap is ~50% for both experiments.')
print(f'This suggests the LOO validation is optimistic.')
print(f'\nHowever, the RELATIVE ranking is preserved:')
print(f'  exp_004 (CV 0.0623) -> LB 0.0956')
print(f'  exp_006 (CV 0.0688) -> LB 0.0991')
print(f'  Better CV = Better LB (correlation holds)')

CV-LB GAP ANALYSIS
exp_004: CV=0.0623, LB=0.0956, Gap=53.5%
exp_006: CV=0.0688, LB=0.0991, Gap=44.0%

The CV-LB gap is ~50% for both experiments.
This suggests the LOO validation is optimistic.

However, the RELATIVE ranking is preserved:
  exp_004 (CV 0.0623) -> LB 0.0956
  exp_006 (CV 0.0688) -> LB 0.0991
  Better CV = Better LB (correlation holds)


In [6]:
# Next steps
print('='*80)
print('RECOMMENDED NEXT STEPS')
print('='*80)
print()
print('1. REPLICATE exp_004 EXACTLY for full data:')
print('   - Train SEPARATE models on spange and acs_pca')
print('   - Combine PREDICTIONS: 0.8 * acs_pred + 0.2 * spange_pred')
print('   - Use same hyperparameters: HGB(depth=7, iter=700, lr=0.04)')
print('   - Use same hyperparameters: ETR(n_estimators=500, depth=10)')
print()
print('2. Keep exp_015 approach for single solvent:')
print('   - Deep models + MLP work better for single solvent')
print('   - CV 0.0638 vs exp_004 0.0659')
print()
print('3. Create TRUE hybrid:')
print('   - Single: exp_015 approach (deep + MLP)')
print('   - Full: exp_004 approach (PREDICTION combination)')
print('   - Expected: 0.0638 * 0.35 + 0.0603 * 0.65 = 0.0615')

RECOMMENDED NEXT STEPS

1. REPLICATE exp_004 EXACTLY for full data:
   - Train SEPARATE models on spange and acs_pca
   - Combine PREDICTIONS: 0.8 * acs_pred + 0.2 * spange_pred
   - Use same hyperparameters: HGB(depth=7, iter=700, lr=0.04)
   - Use same hyperparameters: ETR(n_estimators=500, depth=10)

2. Keep exp_015 approach for single solvent:
   - Deep models + MLP work better for single solvent
   - CV 0.0638 vs exp_004 0.0659

3. Create TRUE hybrid:
   - Single: exp_015 approach (deep + MLP)
   - Full: exp_004 approach (PREDICTION combination)
   - Expected: 0.0638 * 0.35 + 0.0603 * 0.65 = 0.0615
