# Loop 1 Analysis: Understanding the Gap to Target

**Current Status:**
- Best CV: 0.081393 (exp_000)
- Target: 0.017270
- Gap: ~4.7x

**Goals:**
1. Analyze per-fold errors to identify which solvents are hardest
2. Understand the relationship between targets
3. Identify potential improvements

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load data
DATA_PATH = '/home/data'
full_data = pd.read_csv(f'{DATA_PATH}/catechol_full_data_yields.csv')
single_data = pd.read_csv(f'{DATA_PATH}/catechol_single_solvent_yields.csv')
spange = pd.read_csv(f'{DATA_PATH}/spange_descriptors_lookup.csv', index_col=0)

print(f"Single solvent: {single_data.shape}")
print(f"Full data: {full_data.shape}")
print(f"Unique solvents (single): {single_data['SOLVENT NAME'].nunique()}")
print(f"Unique ramps (full): {full_data[['SOLVENT A NAME', 'SOLVENT B NAME']].drop_duplicates().shape[0]}")

Single solvent: (656, 13)
Full data: (1227, 19)
Unique solvents (single): 24
Unique ramps (full): 13


In [2]:
# Analyze target correlations
print("Target correlations (single solvent):")
print(single_data[['SM', 'Product 2', 'Product 3']].corr())

print("\nTarget correlations (full data):")
print(full_data[['SM', 'Product 2', 'Product 3']].corr())

# Check if there's a strong relationship between targets
print("\nTarget sum statistics:")
single_data['target_sum'] = single_data['SM'] + single_data['Product 2'] + single_data['Product 3']
print(single_data['target_sum'].describe())

Target correlations (single solvent):
                 SM  Product 2  Product 3
SM         1.000000  -0.890121  -0.767935
Product 2 -0.890121   1.000000   0.923287
Product 3 -0.767935   0.923287   1.000000

Target correlations (full data):
                 SM  Product 2  Product 3
SM         1.000000  -0.870110  -0.772187
Product 2 -0.870110   1.000000   0.932859
Product 3 -0.772187   0.932859   1.000000

Target sum statistics:
count    656.000000
mean       0.795504
std        0.194306
min        0.028752
25%        0.708417
50%        0.849648
75%        0.927955
max        1.000000
Name: target_sum, dtype: float64


In [3]:
# Analyze per-solvent statistics to understand which solvents might be harder
solvent_stats = single_data.groupby('SOLVENT NAME').agg({
    'SM': ['mean', 'std'],
    'Product 2': ['mean', 'std'],
    'Product 3': ['mean', 'std'],
    'Temperature': 'count'
}).round(4)
solvent_stats.columns = ['SM_mean', 'SM_std', 'P2_mean', 'P2_std', 'P3_mean', 'P3_std', 'count']
solvent_stats = solvent_stats.sort_values('SM_mean', ascending=False)
print("Per-solvent statistics (sorted by SM mean):")
print(solvent_stats)

Per-solvent statistics (sorted by SM mean):
                                    SM_mean  SM_std  P2_mean  P2_std  P3_mean  \
SOLVENT NAME                                                                    
MTBE [tert-Butylmethylether]         0.8793  0.0403   0.0441  0.0174   0.0357   
Dimethyl Carbonate                   0.8718  0.0454   0.0581  0.0197   0.0314   
Diethyl Ether [Ether]                0.8040  0.2165   0.0811  0.0966   0.0631   
Methyl Propionate                    0.7189  0.0377   0.0261  0.0110   0.0193   
Butanone [MEK]                       0.7169  0.0537   0.0472  0.0184   0.0430   
tert-Butanol [2-Methylpropan-2-ol]   0.6988  0.0678   0.0723  0.0285   0.0653   
Ethyl Acetate                        0.6934  0.0557   0.0430  0.0142   0.0397   
Ethyl Lactate                        0.6590  0.0863   0.1275  0.0428   0.1367   
Dihydrolevoglucosenone (Cyrene)      0.6233  0.1153   0.1690  0.0687   0.1408   
Acetonitrile                         0.5805  0.3994   0.1564  0.1

In [4]:
# Check which solvents have highest variance (might be harder to predict)
solvent_stats['total_std'] = solvent_stats['SM_std'] + solvent_stats['P2_std'] + solvent_stats['P3_std']
print("\nSolvents with highest variance (potentially harder to predict):")
print(solvent_stats.sort_values('total_std', ascending=False)[['SM_std', 'P2_std', 'P3_std', 'total_std']].head(10))


Solvents with highest variance (potentially harder to predict):
                                   SM_std  P2_std  P3_std  total_std
SOLVENT NAME                                                        
IPA [Propan-2-ol]                  0.4526  0.2195  0.2280     0.9001
Decanol                            0.4313  0.1747  0.1844     0.7904
Ethylene Glycol [1,2-Ethanediol]   0.3897  0.1796  0.2141     0.7834
Water.Acetonitrile                 0.3772  0.1708  0.1531     0.7011
Ethanol                            0.3788  0.1515  0.1660     0.6963
Methanol                           0.3900  0.1368  0.1206     0.6474
Acetonitrile                       0.3994  0.1476  0.0902     0.6372
DMA [N,N-Dimethylacetamide]        0.3964  0.1265  0.1026     0.6255
Water.2,2,2-Trifluoroethanol       0.3539  0.1278  0.0835     0.5652
2-Methyltetrahydrofuran [2-MeTHF]  0.3257  0.1459  0.0884     0.5600


In [5]:
# Check the Spange descriptors to understand solvent diversity
print("Spange descriptor statistics:")
print(spange.describe())

# Check which solvents are most different from others
from sklearn.preprocessing import StandardScaler
spange_scaled = StandardScaler().fit_transform(spange.values)
spange_df = pd.DataFrame(spange_scaled, index=spange.index, columns=spange.columns)

# Calculate mean distance from centroid
centroid = spange_df.mean(axis=0)
distances = np.sqrt(((spange_df - centroid) ** 2).sum(axis=1))
print("\nSolvents furthest from centroid (most unusual):")
print(distances.sort_values(ascending=False).head(10))

Spange descriptor statistics:
       dielectric constant     ET(30)      alpha       beta        pi*  \
count            26.000000  26.000000  26.000000  26.000000  26.000000   
mean             20.550462  46.531923   0.528423   0.481115   0.615192   
std              20.176418   9.496767   0.571158   0.235834   0.235445   
min               2.020000  30.900000   0.000000   0.000000   0.000000   
25%               6.170000  38.450000   0.000000   0.432500   0.502500   
50%              14.100000  44.450000   0.395000   0.460000   0.590000   
75%              30.650000  54.525000   0.890000   0.545000   0.730000   
max              80.100000  63.100000   1.960000   0.930000   1.090000   

              SA         SB         SP        SdP          N          n  \
count  26.000000  26.000000  26.000000  26.000000  26.000000  26.000000   
mean    0.325642   0.456815   0.660900   0.749569   0.016262   1.374077   
std     0.372702   0.261823   0.062314   0.279132   0.011700   0.044894   
min


Solvents furthest from centroid (most unusual):
SOLVENT NAME
Water                                7.024456
1,1,1,3,3,3-Hexafluoropropan-2-ol    6.698806
Cyclohexane                          5.527855
Water.2,2,2-Trifluoroethanol         5.227873
2,2,2-Trifluoroethanol               4.921644
Water.Acetonitrile                   4.596691
Dihydrolevoglucosenone (Cyrene)      3.768638
Decanol                              3.750947
DMA [N,N-Dimethylacetamide]          3.707034
Ethylene Glycol [1,2-Ethanediol]     3.653196
dtype: float64


In [6]:
# Analyze the relationship between temperature, time, and yields
print("Correlation with Temperature and Residence Time:")
for col in ['SM', 'Product 2', 'Product 3']:
    temp_corr = single_data['Temperature'].corr(single_data[col])
    time_corr = single_data['Residence Time'].corr(single_data[col])
    print(f"{col}: Temp={temp_corr:.3f}, Time={time_corr:.3f}")

# Check if there are non-linear patterns
print("\nNon-linear features might help:")
print("- 1/Temperature (Arrhenius)")
print("- log(Time)")
print("- Temperature * Time interaction")

Correlation with Temperature and Residence Time:
SM: Temp=-0.817, Time=-0.275
Product 2: Temp=0.723, Time=0.252
Product 3: Temp=0.573, Time=0.210

Non-linear features might help:
- 1/Temperature (Arrhenius)
- log(Time)
- Temperature * Time interaction


In [7]:
# Check the full data ramps
ramp_stats = full_data.groupby(['SOLVENT A NAME', 'SOLVENT B NAME']).agg({
    'SM': ['mean', 'std'],
    'Product 2': ['mean', 'std'],
    'Product 3': ['mean', 'std'],
    'Temperature': 'count'
}).round(4)
ramp_stats.columns = ['SM_mean', 'SM_std', 'P2_mean', 'P2_std', 'P3_mean', 'P3_std', 'count']
print("Per-ramp statistics:")
print(ramp_stats)

Per-ramp statistics:
                                                                      SM_mean  \
SOLVENT A NAME                     SOLVENT B NAME                               
1,1,1,3,3,3-Hexafluoropropan-2-ol  2-Methyltetrahydrofuran [2-MeTHF]   0.3607   
2,2,2-Trifluoroethanol             Water.2,2,2-Trifluoroethanol        0.3193   
2-Methyltetrahydrofuran [2-MeTHF]  Diethyl Ether [Ether]               0.6450   
Acetonitrile                       Acetonitrile.Acetic Acid            0.5040   
Cyclohexane                        IPA [Propan-2-ol]                   0.5627   
DMA [N,N-Dimethylacetamide]        Decanol                             0.4246   
Dihydrolevoglucosenone (Cyrene)    Ethyl Acetate                       0.6584   
Ethanol                            THF [Tetrahydrofuran]               0.5166   
MTBE [tert-Butylmethylether]       Butanone [MEK]                      0.7933   
Methanol                           Ethylene Glycol [1,2-Ethanediol]    0.4100   
Methyl 

In [8]:
# Key insight: The target is 0.017270 which is ~4.7x better than our 0.081393
# This suggests we need fundamentally different approaches

print("=" * 60)
print("KEY INSIGHTS FOR IMPROVEMENT")
print("=" * 60)

print("\n1. CURRENT GAP ANALYSIS:")
print(f"   - Current CV: 0.081393")
print(f"   - Target: 0.017270")
print(f"   - Gap: {0.081393 / 0.017270:.1f}x")

print("\n2. POTENTIAL IMPROVEMENTS:")
print("   a) Better feature engineering:")
print("      - Try DRFP (2048-dim) or fragprints (2133-dim)")
print("      - Add more physics-based features")
print("   b) Different model architectures:")
print("      - Gaussian Process with chemistry kernels")
print("      - Per-target models with different hyperparameters")
print("   c) Multi-task learning:")
print("      - Exploit correlations between SM, P2, P3")
print("   d) Regularization:")
print("      - Stronger dropout, weight decay")
print("      - Early stopping with validation")

print("\n3. VALIDATION STRATEGY:")
print("   - Leave-one-solvent-out for single (24 folds)")
print("   - Leave-one-ramp-out for full (13 folds)")
print("   - This is correct and matches competition requirements")

KEY INSIGHTS FOR IMPROVEMENT

1. CURRENT GAP ANALYSIS:
   - Current CV: 0.081393
   - Target: 0.017270
   - Gap: 4.7x

2. POTENTIAL IMPROVEMENTS:
   a) Better feature engineering:
      - Try DRFP (2048-dim) or fragprints (2133-dim)
      - Add more physics-based features
   b) Different model architectures:
      - Gaussian Process with chemistry kernels
      - Per-target models with different hyperparameters
   c) Multi-task learning:
      - Exploit correlations between SM, P2, P3
   d) Regularization:
      - Stronger dropout, weight decay
      - Early stopping with validation

3. VALIDATION STRATEGY:
   - Leave-one-solvent-out for single (24 folds)
   - Leave-one-ramp-out for full (13 folds)
   - This is correct and matches competition requirements
