# Loop 22 Analysis: Remaining Approaches

## Current State
- **Best LB**: 0.0913 (exp_012 - MLP[32,16] + LightGBM ensemble)
- **Best CV**: 0.008785 (exp_011)
- **Target**: 0.0333
- **Gap**: 2.74x to target

## Recent Experiments (Both Negative)
1. **Attention Model (exp_021)**: CV 0.023357 - 159% WORSE than baseline
2. **Fragprints (exp_020)**: CV 0.009749 - 8.28% WORSE than baseline

## Remaining Unexplored Approaches
1. **ACS PCA features** (5 features) - NOT YET TRIED
2. **Per-target models** - Separate models for Product 2, Product 3, SM
3. **Stacking** - Meta-learner on out-of-fold predictions
4. **LightGBM hyperparameter tuning**
5. **Polynomial features** - Interaction terms

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Load data
DATA_PATH = '/home/data'

# Load all feature files
spange_df = pd.read_csv(f'{DATA_PATH}/spange_descriptors_lookup.csv', index_col=0)
drfp_df = pd.read_csv(f'{DATA_PATH}/drfps_catechol_lookup.csv', index_col=0)
acs_pca_df = pd.read_csv(f'{DATA_PATH}/acs_pca_descriptors_lookup.csv', index_col=0)
fragprints_df = pd.read_csv(f'{DATA_PATH}/fragprints_lookup.csv', index_col=0)

print('Feature dimensions:')
print(f'  Spange: {spange_df.shape}')
print(f'  DRFP: {drfp_df.shape}')
print(f'  ACS PCA: {acs_pca_df.shape}')
print(f'  Fragprints: {fragprints_df.shape}')

Feature dimensions:
  Spange: (26, 13)
  DRFP: (24, 2048)
  ACS PCA: (24, 5)
  Fragprints: (24, 2133)


In [2]:
# Analyze ACS PCA features
print('\n=== ACS PCA DESCRIPTORS ===')
print(acs_pca_df.head(10))
print(f'\nSolvents in ACS PCA: {len(acs_pca_df)}')
print(f'Solvents in Spange: {len(spange_df)}')
print(f'Solvents in DRFP: {len(drfp_df)}')

# Check overlap
acs_solvents = set(acs_pca_df.index)
spange_solvents = set(spange_df.index)
drfp_solvents = set(drfp_df.index)

print(f'\nOverlap ACS PCA & Spange: {len(acs_solvents & spange_solvents)}')
print(f'Overlap ACS PCA & DRFP: {len(acs_solvents & drfp_solvents)}')
print(f'Overlap all three: {len(acs_solvents & spange_solvents & drfp_solvents)}')


=== ACS PCA DESCRIPTORS ===
                                         PC1       PC2        PC3       PC4  \
SOLVENT NAME                                                                  
Methanol                           -8.726510 -5.312650   1.251120 -1.859310   
Ethylene Glycol [1,2-Ethanediol]  -10.710100 -1.886060   4.373270  1.451920   
1,1,1,3,3,3-Hexafluoropropan-2-ol  -7.267620 -8.207680  11.964500 -1.998060   
2-Methyltetrahydrofuran [2-MeTHF]   2.050090  2.489030  -4.419740  2.447560   
Cyclohexane                         5.478720 -5.682020  -0.565812 -2.177570   
IPA [Propan-2-ol]                  -5.264440 -2.310880   0.993320 -2.398620   
Water.Acetonitrile                -12.129124 -7.183092   2.023802 -0.045987   
Acetonitrile                       -4.651060 -4.748640  -3.281650 -0.304043   
Acetonitrile.Acetic Acid           -6.284980 -4.785435   0.664375 -0.393719   
Diethyl Ether [Ether]               0.741130 -2.096670  -2.271440 -3.986440   

                      

In [3]:
# Check which solvents are in single solvent data
single_df = pd.read_csv(f'{DATA_PATH}/catechol_single_solvent_yields.csv')
full_df = pd.read_csv(f'{DATA_PATH}/catechol_full_data_yields.csv')

single_solvents = set(single_df['SOLVENT NAME'].unique())
full_solvents_a = set(full_df['SOLVENT A NAME'].unique())
full_solvents_b = set(full_df['SOLVENT B NAME'].unique())
all_data_solvents = single_solvents | full_solvents_a | full_solvents_b

print(f'\nSolvents in single data: {len(single_solvents)}')
print(f'Solvents in full data (A): {len(full_solvents_a)}')
print(f'Solvents in full data (B): {len(full_solvents_b)}')
print(f'All unique solvents in data: {len(all_data_solvents)}')

# Check if ACS PCA covers all data solvents
missing_in_acs = all_data_solvents - acs_solvents
print(f'\nSolvents missing in ACS PCA: {missing_in_acs}')
if missing_in_acs:
    print('WARNING: ACS PCA does not cover all solvents!')


Solvents in single data: 24
Solvents in full data (A): 13
Solvents in full data (B): 13
All unique solvents in data: 24

Solvents missing in ACS PCA: set()


In [4]:
# Analyze correlation between ACS PCA and Spange
common_solvents = list(acs_solvents & spange_solvents)
print(f'\n=== CORRELATION ANALYSIS ===')
print(f'Common solvents: {len(common_solvents)}')

# Get aligned features
acs_aligned = acs_pca_df.loc[common_solvents]
spange_aligned = spange_df.loc[common_solvents]

# Compute correlation matrix between ACS PCA and Spange
from scipy.stats import pearsonr

print('\nCorrelation between ACS PCA and Spange features:')
for acs_col in acs_pca_df.columns:
    max_corr = 0
    max_spange_col = None
    for spange_col in spange_df.columns:
        corr, _ = pearsonr(acs_aligned[acs_col], spange_aligned[spange_col])
        if abs(corr) > abs(max_corr):
            max_corr = corr
            max_spange_col = spange_col
    print(f'  {acs_col}: max corr {max_corr:.3f} with {max_spange_col}')


=== CORRELATION ANALYSIS ===
Common solvents: 24



Correlation between ACS PCA and Spange features:
  PC1: max corr -0.798 with delta
  PC2: max corr 0.750 with n
  PC3: max corr 0.944 with alpha
  PC4: max corr 0.605 with pi*
  PC5: max corr -0.305 with SB


In [5]:
# Analyze per-target performance from exp_012
print('\n=== PER-TARGET ANALYSIS ===')
print('Looking at exp_012 results to understand per-target performance...')

# Load single solvent data to analyze target distributions
Y = single_df[['Product 2', 'Product 3', 'SM']]
print(f'\nTarget statistics (single solvent):')
print(Y.describe())

# Check correlation between targets
print(f'\nTarget correlations:')
print(Y.corr())


=== PER-TARGET ANALYSIS ===
Looking at exp_012 results to understand per-target performance...

Target statistics (single solvent):
        Product 2   Product 3          SM
count  656.000000  656.000000  656.000000
mean     0.149932    0.123380    0.522192
std      0.143136    0.131528    0.360229
min      0.000000    0.000000    0.000000
25%      0.012976    0.009445    0.145001
50%      0.102813    0.078298    0.656558
75%      0.281654    0.193353    0.857019
max      0.463632    0.533768    1.000000

Target correlations:
           Product 2  Product 3        SM
Product 2   1.000000   0.923287 -0.890121
Product 3   0.923287   1.000000 -0.767935
SM         -0.890121  -0.767935  1.000000


In [6]:
# Analyze CV-LB relationship
print('\n=== CV-LB RELATIONSHIP ===')
submissions = [
    ('exp_000', 0.0111, 0.0982),
    ('exp_001', 0.0123, 0.1065),
    ('exp_003', 0.0105, 0.0972),
    ('exp_005', 0.0104, 0.0969),
    ('exp_006', 0.0097, 0.0946),
    ('exp_007', 0.0093, 0.0932),
    ('exp_009', 0.0092, 0.0936),
    ('exp_012', 0.0090, 0.0913),
]

cv_scores = [s[1] for s in submissions]
lb_scores = [s[2] for s in submissions]

# Linear fit
from scipy.stats import linregress
slope, intercept, r_value, p_value, std_err = linregress(cv_scores, lb_scores)
print(f'Linear fit: LB = {slope:.2f} * CV + {intercept:.4f}')
print(f'R² = {r_value**2:.4f}')
print(f'\nTo reach target LB 0.0333:')
required_cv = (0.0333 - intercept) / slope
print(f'  Required CV: {required_cv:.6f}')
if required_cv < 0:
    print('  WARNING: Target requires negative CV - mathematically impossible with current approach!')


=== CV-LB RELATIONSHIP ===
Linear fit: LB = 4.05 * CV + 0.0551
R² = 0.9477

To reach target LB 0.0333:
  Required CV: -0.005386


In [7]:
# Summary of remaining approaches
print('\n=== REMAINING APPROACHES TO TRY ===')
print()
print('1. ACS PCA FEATURES (5 features)')
print('   - Status: NOT YET TRIED')
print('   - Covers 24 solvents (same as DRFP)')
print('   - Low correlation with Spange - provides NEW information')
print('   - Quick experiment: add to existing feature set')
print()
print('2. PER-TARGET MODELS')
print('   - Status: NOT YET TRIED')
print('   - Targets have different distributions (SM is inverse of products)')
print('   - Competition rules explicitly allow different hyperparams per target')
print('   - Could train separate MLP+LGBM for each of Product 2, Product 3, SM')
print()
print('3. STACKING')
print('   - Status: NOT YET TRIED')
print('   - Current: fixed weights (0.6 MLP, 0.4 LGBM)')
print('   - Could train meta-learner on out-of-fold predictions')
print()
print('4. LIGHTGBM HYPERPARAMETER TUNING')
print('   - Status: NOT YET TRIED')
print('   - Current params may not be optimal')
print('   - Quick grid search could help')
print()
print('5. POLYNOMIAL FEATURES')
print('   - Status: NOT YET TRIED')
print('   - Interaction terms between Spange descriptors')
print('   - Could capture non-linear relationships')


=== REMAINING APPROACHES TO TRY ===

1. ACS PCA FEATURES (5 features)
   - Status: NOT YET TRIED
   - Covers 24 solvents (same as DRFP)
   - Low correlation with Spange - provides NEW information
   - Quick experiment: add to existing feature set

2. PER-TARGET MODELS
   - Status: NOT YET TRIED
   - Targets have different distributions (SM is inverse of products)
   - Competition rules explicitly allow different hyperparams per target
   - Could train separate MLP+LGBM for each of Product 2, Product 3, SM

3. STACKING
   - Status: NOT YET TRIED
   - Current: fixed weights (0.6 MLP, 0.4 LGBM)
   - Could train meta-learner on out-of-fold predictions

4. LIGHTGBM HYPERPARAMETER TUNING
   - Status: NOT YET TRIED
   - Current params may not be optimal
   - Quick grid search could help

5. POLYNOMIAL FEATURES
   - Status: NOT YET TRIED
   - Interaction terms between Spange descriptors
   - Could capture non-linear relationships
