# Loop 1 LB Feedback Analysis

**Critical Issue:** CV score 0.0111 vs LB score 0.0982 - a 9x gap!

Possible causes:
1. Notebook structure non-compliance (evaluator's concern)
2. CV methodology mismatch with LB evaluation
3. Data leakage in local CV
4. Different evaluation metric on LB
5. Distribution shift between train/test

In [1]:
import pandas as pd
import numpy as np

# Load our submission
submission = pd.read_csv('/home/submission/submission.csv')
print('Submission shape:', submission.shape)
print('\nColumns:', submission.columns.tolist())
print('\nFirst 10 rows:')
print(submission.head(10))

Submission shape: (1883, 8)

Columns: ['id', 'index', 'task', 'fold', 'row', 'target_1', 'target_2', 'target_3']

First 10 rows:
   id  index  task  fold  row  target_1  target_2  target_3
0   0      0     0     0    0  0.011080  0.010218  0.897487
1   1      1     0     0    1  0.028685  0.025615  0.821114
2   2      2     0     0    2  0.067066  0.059342  0.693974
3   3      3     0     0    3  0.104709  0.081861  0.584283
4   4      4     0     0    4  0.125673  0.089247  0.525820
5   5      5     0     0    5  0.132799  0.090096  0.504345
6   6      6     0     0    6  0.132781  0.090095  0.504405
7   7      7     0     0    7  0.132757  0.090094  0.504482
8   8      8     0     0    8  0.132820  0.090097  0.504278
9   9      9     0     0    9  0.132820  0.090097  0.504278


In [2]:
# Check the submission format
print('Task distribution:')
print(submission['task'].value_counts())

print('\nFold distribution for task 0 (single solvent):')
print(submission[submission['task']==0]['fold'].value_counts().sort_index())

print('\nFold distribution for task 1 (full data):')
print(submission[submission['task']==1]['fold'].value_counts().sort_index())

Task distribution:
task
1    1227
0     656
Name: count, dtype: int64

Fold distribution for task 0 (single solvent):
fold
0     37
1     37
2     58
3     59
4     22
5     18
6     34
7     41
8     20
9     22
10    18
11    18
12    42
13    18
14    17
15    22
16     5
17    16
18    36
19    18
20    21
21    22
22    37
23    18
Name: count, dtype: int64

Fold distribution for task 1 (full data):
fold
0     122
1     124
2     104
3     125
4     125
5     124
6     125
7     110
8     127
9      36
10     34
11     36
12     35
Name: count, dtype: int64


In [3]:
# Check prediction ranges
print('Prediction statistics:')
for col in ['target_1', 'target_2', 'target_3']:
    print(f'\n{col}:')
    print(f'  Min: {submission[col].min():.4f}')
    print(f'  Max: {submission[col].max():.4f}')
    print(f'  Mean: {submission[col].mean():.4f}')
    print(f'  Std: {submission[col].std():.4f}')

Prediction statistics:

target_1:
  Min: 0.0001
  Max: 0.4517
  Mean: 0.1678
  Std: 0.1435

target_2:
  Min: 0.0001
  Max: 0.4423
  Mean: 0.1434
  Std: 0.1325

target_3:
  Min: 0.0000
  Max: 0.9970
  Mean: 0.5138
  Std: 0.3497


In [4]:
# Load actual data to compare
DATA_PATH = '/home/data'

single_data = pd.read_csv(f'{DATA_PATH}/catechol_single_solvent_yields.csv')
full_data = pd.read_csv(f'{DATA_PATH}/catechol_full_data_yields.csv')

print('Single solvent data shape:', single_data.shape)
print('Full data shape:', full_data.shape)

# Target columns
print('\nTarget columns in single data:', [c for c in single_data.columns if c in ['SM', 'Product 2', 'Product 3']])
print('Target columns in full data:', [c for c in full_data.columns if c in ['SM', 'Product 2', 'Product 3']])

Single solvent data shape: (656, 13)
Full data shape: (1227, 19)

Target columns in single data: ['SM', 'Product 2', 'Product 3']
Target columns in full data: ['SM', 'Product 2', 'Product 3']


In [5]:
# Check target order - this is critical!
# Our submission has target_1, target_2, target_3
# The template uses Y = df[["Product 2", "Product 3", "SM"]]

print('Target order in our submission:')
print('target_1 = Product 2')
print('target_2 = Product 3')
print('target_3 = SM')

# Let's verify by checking the actual values
print('\nActual target statistics from single solvent data:')
for col in ['Product 2', 'Product 3', 'SM']:
    print(f'{col}: mean={single_data[col].mean():.4f}, std={single_data[col].std():.4f}')

Target order in our submission:
target_1 = Product 2
target_2 = Product 3
target_3 = SM

Actual target statistics from single solvent data:
Product 2: mean=0.1499, std=0.1431
Product 3: mean=0.1234, std=0.1315
SM: mean=0.5222, std=0.3602


In [6]:
# CRITICAL: Check if our target order matches the expected order
# The template loads Y = df[["Product 2", "Product 3", "SM"]]
# So target_1 = Product 2, target_2 = Product 3, target_3 = SM

# But wait - let me check what order we used in our model
# In our baseline, we used: Y = df[["Product 2", "Product 3", "SM"]]
# This should be correct!

# Let's verify by checking the submission predictions vs actuals
print('Checking if our predictions make sense...')
print('\nOur predictions (target_3 = SM):')
print(f'  Mean: {submission["target_3"].mean():.4f}')
print(f'  Actual SM mean: {single_data["SM"].mean():.4f}')

print('\nOur predictions (target_1 = Product 2):')
print(f'  Mean: {submission["target_1"].mean():.4f}')
print(f'  Actual Product 2 mean: {single_data["Product 2"].mean():.4f}')

Checking if our predictions make sense...

Our predictions (target_3 = SM):
  Mean: 0.5138
  Actual SM mean: 0.5222

Our predictions (target_1 = Product 2):
  Mean: 0.1678
  Actual Product 2 mean: 0.1499


In [7]:
# Let me check the template's expected submission format more carefully
# The template saves predictions in order: target_1, target_2, target_3
# And the model predicts in the order of Y columns

# In utils.py, TARGET_LABELS is defined - let me check
import sys
sys.path.insert(0, '/home/data')

# Read utils.py to find TARGET_LABELS
with open('/home/data/utils.py', 'r') as f:
    content = f.read()
    print('Looking for TARGET_LABELS in utils.py...')
    for line in content.split('\n'):
        if 'TARGET' in line or 'LABEL' in line:
            print(line)

Looking for TARGET_LABELS in utils.py...
        X = df[INPUT_LABELS_FULL_SOLVENT]
        X = df[INPUT_LABELS_SINGLE_SOLVENT]
    Y = df[TARGET_LABELS]
INPUT_LABELS_FULL_SOLVENT = [
INPUT_LABELS_SINGLE_SOLVENT = [
INPUT_LABELS_NUMERIC = [
INPUT_LABELS_SINGLE_FEATURES = [
INPUT_LABELS_FULL_FEATURES = [
TARGET_LABELS = [


In [8]:
# The key insight: The LB evaluation likely uses a different CV procedure
# or the submission format is different from what we're producing

# Let me check if the issue is with how we're computing MSE locally
# vs how the LB computes it

# Our local MSE calculation:
# MSE = mean((actuals - predictions)^2)

# But the LB might be computing it differently
# For example, it might be computing MSE per fold and then averaging

# Let's recalculate our MSE in different ways
print('Recalculating MSE in different ways...')

# Method 1: Overall MSE (what we did)
print('\nMethod 1: Overall MSE (our method)')
print('Single: 0.010429, Full: 0.011429, Overall: 0.011081')

# Method 2: Average MSE per fold
print('\nMethod 2: Would need to recalculate per-fold MSE and average')

Recalculating MSE in different ways...

Method 1: Overall MSE (our method)
Single: 0.010429, Full: 0.011429, Overall: 0.011081

Method 2: Would need to recalculate per-fold MSE and average


In [9]:
# HYPOTHESIS: The LB score of 0.0982 is suspiciously close to the reference kernel's score of 0.09831
# This suggests our submission might have been evaluated correctly, but our LOCAL CV is wrong!

# The reference kernel (arrhenius-kinetics-tta) achieved LB 0.09831
# Our submission got LB 0.0982 - almost identical!

# This means:
# 1. Our model is working correctly on the LB
# 2. Our LOCAL CV calculation is WRONG - it's too optimistic

# The issue is likely that we're computing MSE on the wrong data or in the wrong way

print('CRITICAL INSIGHT:')
print('LB score 0.0982 ≈ Reference kernel score 0.09831')
print('This suggests our model is working correctly!')
print('The issue is our LOCAL CV calculation is too optimistic.')
print('')
print('Possible causes:')
print('1. We might be computing MSE on training data instead of test data')
print('2. We might have data leakage in our CV')
print('3. The CV methodology might be different')

CRITICAL INSIGHT:
LB score 0.0982 ≈ Reference kernel score 0.09831
This suggests our model is working correctly!
The issue is our LOCAL CV calculation is too optimistic.

Possible causes:
1. We might be computing MSE on training data instead of test data
2. We might have data leakage in our CV
3. The CV methodology might be different


In [10]:
# Let me check our baseline notebook to see if there's a bug
# Looking at the code, I see we store actuals and predictions correctly
# But let me verify the MSE calculation

# The issue might be that we're computing MSE differently
# Let's check if the LB uses a different metric

# Actually, looking at the competition description, it says:
# "Submissions will be evaluated according to a cross-validation procedure"
# This means the LB runs the ENTIRE CV procedure, not just evaluates predictions

print('KEY INSIGHT:')
print('The competition evaluates by RUNNING the CV procedure on Kaggle!')
print('This means our local CV predictions are not directly comparable.')
print('')
print('The LB score of 0.0982 is the ACTUAL CV score from running our model.')
print('Our local CV score of 0.0111 might be computed incorrectly.')

KEY INSIGHT:
The competition evaluates by RUNNING the CV procedure on Kaggle!
This means our local CV predictions are not directly comparable.

The LB score of 0.0982 is the ACTUAL CV score from running our model.
Our local CV score of 0.0111 might be computed incorrectly.


In [11]:
# Let me verify by checking if our predictions are reasonable
# If our model is predicting well, the predictions should be close to actuals

# Load actuals for single solvent
single_actuals = single_data[['Product 2', 'Product 3', 'SM']].values
print('Single solvent actuals shape:', single_actuals.shape)

# Our predictions for single solvent (task 0)
single_preds = submission[submission['task']==0][['target_1', 'target_2', 'target_3']].values
print('Single solvent predictions shape:', single_preds.shape)

# Calculate MSE
mse = np.mean((single_actuals - single_preds) ** 2)
print(f'\nRecalculated Single Solvent MSE: {mse:.6f}')

Single solvent actuals shape: (656, 3)
Single solvent predictions shape: (656, 3)

Recalculated Single Solvent MSE: 0.109481


In [12]:
# Same for full data
full_actuals = full_data[['Product 2', 'Product 3', 'SM']].values
print('Full data actuals shape:', full_actuals.shape)

# Our predictions for full data (task 1)
full_preds = submission[submission['task']==1][['target_1', 'target_2', 'target_3']].values
print('Full data predictions shape:', full_preds.shape)

# Calculate MSE
mse_full = np.mean((full_actuals - full_preds) ** 2)
print(f'\nRecalculated Full Data MSE: {mse_full:.6f}')

# Overall
n_single = len(single_actuals)
n_full = len(full_actuals)
overall_mse = (mse * n_single + mse_full * n_full) / (n_single + n_full)
print(f'\nOverall MSE: {overall_mse:.6f}')

Full data actuals shape: (1227, 3)
Full data predictions shape: (1227, 3)

Recalculated Full Data MSE: 0.011429

Overall MSE: 0.045588


In [13]:
# The recalculated MSE should match our original calculation
# If it doesn't, there's a bug in our original code

# Let me also check if the predictions are in the right order
# The submission should have predictions for each fold in order

print('Checking prediction order...')
print('\nTask 0 (single solvent):')
for fold in range(24):
    fold_data = submission[(submission['task']==0) & (submission['fold']==fold)]
    print(f'  Fold {fold}: {len(fold_data)} rows')

Checking prediction order...

Task 0 (single solvent):
  Fold 0: 37 rows
  Fold 1: 37 rows
  Fold 2: 58 rows
  Fold 3: 59 rows
  Fold 4: 22 rows
  Fold 5: 18 rows
  Fold 6: 34 rows
  Fold 7: 41 rows
  Fold 8: 20 rows
  Fold 9: 22 rows
  Fold 10: 18 rows
  Fold 11: 18 rows
  Fold 12: 42 rows
  Fold 13: 18 rows
  Fold 14: 17 rows
  Fold 15: 22 rows
  Fold 16: 5 rows
  Fold 17: 16 rows
  Fold 18: 36 rows
  Fold 19: 18 rows
  Fold 20: 21 rows
  Fold 21: 22 rows
  Fold 22: 37 rows
  Fold 23: 18 rows


In [14]:
# Now I understand the issue!
# The submission file contains predictions for EACH FOLD of the CV
# But the actuals are the FULL dataset

# The correct way to compute MSE is to match predictions to actuals BY FOLD
# Each fold's predictions correspond to the test set for that fold

# For single solvent CV (leave-one-solvent-out):
# - Fold 0 predictions are for solvent 0's data
# - Fold 1 predictions are for solvent 1's data
# etc.

# Let me verify this by checking the number of rows per fold
print('Verifying fold structure...')
print('\nSingle solvent data by solvent:')
solvent_counts = single_data['SOLVENT NAME'].value_counts().sort_index()
print(solvent_counts)

print('\nSubmission rows per fold (task 0):')
for fold in range(24):
    fold_data = submission[(submission['task']==0) & (submission['fold']==fold)]
    print(f'  Fold {fold}: {len(fold_data)} rows')

Verifying fold structure...

Single solvent data by solvent:
SOLVENT NAME
1,1,1,3,3,3-Hexafluoropropan-2-ol     37
2,2,2-Trifluoroethanol                37
2-Methyltetrahydrofuran [2-MeTHF]     58
Acetonitrile                          59
Acetonitrile.Acetic Acid              22
Butanone [MEK]                        18
Cyclohexane                           34
DMA [N,N-Dimethylacetamide]           41
Decanol                               20
Diethyl Ether [Ether]                 22
Dihydrolevoglucosenone (Cyrene)       18
Dimethyl Carbonate                    18
Ethanol                               42
Ethyl Acetate                         18
Ethyl Lactate                         17
Ethylene Glycol [1,2-Ethanediol]      22
IPA [Propan-2-ol]                      5
MTBE [tert-Butylmethylether]          16
Methanol                              36
Methyl Propionate                     18
THF [Tetrahydrofuran]                 21
Water.2,2,2-Trifluoroethanol          22
Water.Acetonitrile      

In [15]:
# The fold sizes should match the solvent counts!
# Let me verify this more carefully

solvents = sorted(single_data['SOLVENT NAME'].unique())
print('Solvents in order:', solvents[:5], '...')

for i, solvent in enumerate(solvents):
    solvent_count = len(single_data[single_data['SOLVENT NAME'] == solvent])
    fold_count = len(submission[(submission['task']==0) & (submission['fold']==i)])
    match = '✓' if solvent_count == fold_count else '✗'
    print(f'Fold {i} ({solvent}): solvent={solvent_count}, submission={fold_count} {match}')

Solvents in order: ['1,1,1,3,3,3-Hexafluoropropan-2-ol', '2,2,2-Trifluoroethanol', '2-Methyltetrahydrofuran [2-MeTHF]', 'Acetonitrile', 'Acetonitrile.Acetic Acid'] ...
Fold 0 (1,1,1,3,3,3-Hexafluoropropan-2-ol): solvent=37, submission=37 ✓
Fold 1 (2,2,2-Trifluoroethanol): solvent=37, submission=37 ✓
Fold 2 (2-Methyltetrahydrofuran [2-MeTHF]): solvent=58, submission=58 ✓
Fold 3 (Acetonitrile): solvent=59, submission=59 ✓
Fold 4 (Acetonitrile.Acetic Acid): solvent=22, submission=22 ✓
Fold 5 (Butanone [MEK]): solvent=18, submission=18 ✓
Fold 6 (Cyclohexane): solvent=34, submission=34 ✓
Fold 7 (DMA [N,N-Dimethylacetamide]): solvent=41, submission=41 ✓
Fold 8 (Decanol): solvent=20, submission=20 ✓
Fold 9 (Diethyl Ether [Ether]): solvent=22, submission=22 ✓
Fold 10 (Dihydrolevoglucosenone (Cyrene)): solvent=18, submission=18 ✓
Fold 11 (Dimethyl Carbonate): solvent=18, submission=18 ✓
Fold 12 (Ethanol): solvent=42, submission=42 ✓
Fold 13 (Ethyl Acetate): solvent=18, submission=18 ✓
Fold 14 (

In [16]:
# Now let me properly compute the MSE by matching predictions to actuals
# For each fold, the predictions correspond to the test set (one solvent)

def compute_cv_mse_single():
    solvents = sorted(single_data['SOLVENT NAME'].unique())
    all_preds = []
    all_actuals = []
    
    for fold_idx, solvent in enumerate(solvents):
        # Get actuals for this solvent
        mask = single_data['SOLVENT NAME'] == solvent
        actuals = single_data[mask][['Product 2', 'Product 3', 'SM']].values
        
        # Get predictions for this fold
        fold_preds = submission[(submission['task']==0) & (submission['fold']==fold_idx)]
        preds = fold_preds[['target_1', 'target_2', 'target_3']].values
        
        all_preds.append(preds)
        all_actuals.append(actuals)
    
    all_preds = np.vstack(all_preds)
    all_actuals = np.vstack(all_actuals)
    
    mse = np.mean((all_actuals - all_preds) ** 2)
    return mse, all_preds, all_actuals

mse_single, preds_single, actuals_single = compute_cv_mse_single()
print(f'Properly computed Single Solvent MSE: {mse_single:.6f}')

Properly computed Single Solvent MSE: 0.010429


In [17]:
# Now for full data (leave-one-ramp-out)
def compute_cv_mse_full():
    # Get unique ramps
    ramps = full_data[['SOLVENT A NAME', 'SOLVENT B NAME']].drop_duplicates()
    ramps = ramps.sort_values(['SOLVENT A NAME', 'SOLVENT B NAME']).reset_index(drop=True)
    
    all_preds = []
    all_actuals = []
    
    for fold_idx, (_, row) in enumerate(ramps.iterrows()):
        # Get actuals for this ramp
        mask = (full_data['SOLVENT A NAME'] == row['SOLVENT A NAME']) & \
               (full_data['SOLVENT B NAME'] == row['SOLVENT B NAME'])
        actuals = full_data[mask][['Product 2', 'Product 3', 'SM']].values
        
        # Get predictions for this fold
        fold_preds = submission[(submission['task']==1) & (submission['fold']==fold_idx)]
        preds = fold_preds[['target_1', 'target_2', 'target_3']].values
        
        if len(actuals) != len(preds):
            print(f'WARNING: Fold {fold_idx} mismatch: actuals={len(actuals)}, preds={len(preds)}')
        
        all_preds.append(preds)
        all_actuals.append(actuals)
    
    all_preds = np.vstack(all_preds)
    all_actuals = np.vstack(all_actuals)
    
    mse = np.mean((all_actuals - all_preds) ** 2)
    return mse, all_preds, all_actuals

mse_full, preds_full, actuals_full = compute_cv_mse_full()
print(f'Properly computed Full Data MSE: {mse_full:.6f}')

Properly computed Full Data MSE: 0.089905


In [18]:
# Overall MSE
n_single = len(actuals_single)
n_full = len(actuals_full)
overall_mse = (mse_single * n_single + mse_full * n_full) / (n_single + n_full)

print(f'\n=== FINAL VERIFICATION ===')
print(f'Single Solvent MSE: {mse_single:.6f} (n={n_single})')
print(f'Full Data MSE: {mse_full:.6f} (n={n_full})')
print(f'Overall MSE: {overall_mse:.6f}')
print(f'\nLB Score: 0.0982')
print(f'Gap: {abs(overall_mse - 0.0982):.6f}')


=== FINAL VERIFICATION ===
Single Solvent MSE: 0.010429 (n=656)
Full Data MSE: 0.089905 (n=1227)
Overall MSE: 0.062217

LB Score: 0.0982
Gap: 0.035983


In [19]:
# CONCLUSION:
# If the properly computed MSE matches our original calculation (~0.011),
# then the issue is that the LB evaluates differently.

# If the properly computed MSE is closer to 0.0982,
# then our original calculation was wrong.

print('\nCONCLUSION:')
if abs(overall_mse - 0.0982) < 0.01:
    print('Our properly computed MSE matches LB - original calculation was wrong!')
elif abs(overall_mse - 0.011) < 0.01:
    print('Our MSE calculation is correct - LB evaluates differently!')
    print('This could be due to:')
    print('1. Different random seeds on Kaggle')
    print('2. Different PyTorch/NumPy versions')
    print('3. Different GPU behavior')
else:
    print(f'MSE is {overall_mse:.6f} - need to investigate further')


CONCLUSION:
MSE is 0.062217 - need to investigate further


In [20]:
# The issue is clear: fold ordering mismatch in full data!
# Let me check how the ramps are ordered in our submission vs the data

print('Ramps in our submission (by fold size):')
for fold in range(13):
    fold_size = len(submission[(submission['task']==1) & (submission['fold']==fold)])
    print(f'  Fold {fold}: {fold_size} rows')

print('\nRamps in data (sorted by SOLVENT A NAME, SOLVENT B NAME):')
ramps = full_data[['SOLVENT A NAME', 'SOLVENT B NAME']].drop_duplicates()
ramps = ramps.sort_values(['SOLVENT A NAME', 'SOLVENT B NAME']).reset_index(drop=True)
for i, (_, row) in enumerate(ramps.iterrows()):
    mask = (full_data['SOLVENT A NAME'] == row['SOLVENT A NAME']) & \
           (full_data['SOLVENT B NAME'] == row['SOLVENT B NAME'])
    count = mask.sum()
    print(f'  Ramp {i} ({row["SOLVENT A NAME"]} + {row["SOLVENT B NAME"]}): {count} rows')

Ramps in our submission (by fold size):
  Fold 0: 122 rows
  Fold 1: 124 rows
  Fold 2: 104 rows
  Fold 3: 125 rows
  Fold 4: 125 rows
  Fold 5: 124 rows
  Fold 6: 125 rows
  Fold 7: 110 rows
  Fold 8: 127 rows
  Fold 9: 36 rows
  Fold 10: 34 rows
  Fold 11: 36 rows
  Fold 12: 35 rows

Ramps in data (sorted by SOLVENT A NAME, SOLVENT B NAME):
  Ramp 0 (1,1,1,3,3,3-Hexafluoropropan-2-ol + 2-Methyltetrahydrofuran [2-MeTHF]): 124 rows
  Ramp 1 (2,2,2-Trifluoroethanol + Water.2,2,2-Trifluoroethanol): 125 rows
  Ramp 2 (2-Methyltetrahydrofuran [2-MeTHF] + Diethyl Ether [Ether]): 124 rows
  Ramp 3 (Acetonitrile + Acetonitrile.Acetic Acid): 125 rows
  Ramp 4 (Cyclohexane + IPA [Propan-2-ol]): 104 rows
  Ramp 5 (DMA [N,N-Dimethylacetamide] + Decanol): 110 rows
  Ramp 6 (Dihydrolevoglucosenone (Cyrene) + Ethyl Acetate): 36 rows
  Ramp 7 (Ethanol + THF [Tetrahydrofuran]): 127 rows
  Ramp 8 (MTBE [tert-Butylmethylether] + Butanone [MEK]): 34 rows
  Ramp 9 (Methanol + Ethylene Glycol [1,2-Ethanedi

In [21]:
# The key insight: Our generate_leave_one_ramp_out_splits function might be ordering differently
# Let me check how the competition's utils.py orders the ramps

# Looking at our baseline code, we used:
# ramps = X[["SOLVENT A NAME", "SOLVENT B NAME"]].drop_duplicates()
# for _, row in ramps.iterrows():  # This iterates in the ORDER they appear in the dataframe!

# But the competition's utils.py might sort them differently
# Let me check the order in which ramps appear in the data

print('Ramps in order of appearance in full_data:')
seen_ramps = []
for i, row in full_data.iterrows():
    ramp = (row['SOLVENT A NAME'], row['SOLVENT B NAME'])
    if ramp not in seen_ramps:
        seen_ramps.append(ramp)
        mask = (full_data['SOLVENT A NAME'] == ramp[0]) & (full_data['SOLVENT B NAME'] == ramp[1])
        count = mask.sum()
        print(f'  Ramp {len(seen_ramps)-1} ({ramp[0]} + {ramp[1]}): {count} rows')

Ramps in order of appearance in full_data:
  Ramp 0 (Methanol + Ethylene Glycol [1,2-Ethanediol]): 122 rows
  Ramp 1 (1,1,1,3,3,3-Hexafluoropropan-2-ol + 2-Methyltetrahydrofuran [2-MeTHF]): 124 rows
  Ramp 2 (Cyclohexane + IPA [Propan-2-ol]): 104 rows
  Ramp 3 (Water.Acetonitrile + Acetonitrile): 125 rows
  Ramp 4 (Acetonitrile + Acetonitrile.Acetic Acid): 125 rows
  Ramp 5 (2-Methyltetrahydrofuran [2-MeTHF] + Diethyl Ether [Ether]): 124 rows
  Ramp 6 (2,2,2-Trifluoroethanol + Water.2,2,2-Trifluoroethanol): 125 rows
  Ramp 7 (DMA [N,N-Dimethylacetamide] + Decanol): 110 rows
  Ramp 8 (Ethanol + THF [Tetrahydrofuran]): 127 rows
  Ramp 9 (Dihydrolevoglucosenone (Cyrene) + Ethyl Acetate): 36 rows
  Ramp 10 (MTBE [tert-Butylmethylether] + Butanone [MEK]): 34 rows
  Ramp 11 (tert-Butanol [2-Methylpropan-2-ol] + Dimethyl Carbonate): 36 rows
  Ramp 12 (Methyl Propionate + Ethyl Lactate): 35 rows


In [22]:
# Now let me properly compute the MSE using the ORDER OF APPEARANCE
def compute_cv_mse_full_correct():
    # Get ramps in order of appearance
    seen_ramps = []
    for i, row in full_data.iterrows():
        ramp = (row['SOLVENT A NAME'], row['SOLVENT B NAME'])
        if ramp not in seen_ramps:
            seen_ramps.append(ramp)
    
    all_preds = []
    all_actuals = []
    
    for fold_idx, ramp in enumerate(seen_ramps):
        # Get actuals for this ramp
        mask = (full_data['SOLVENT A NAME'] == ramp[0]) & (full_data['SOLVENT B NAME'] == ramp[1])
        actuals = full_data[mask][['Product 2', 'Product 3', 'SM']].values
        
        # Get predictions for this fold
        fold_preds = submission[(submission['task']==1) & (submission['fold']==fold_idx)]
        preds = fold_preds[['target_1', 'target_2', 'target_3']].values
        
        if len(actuals) != len(preds):
            print(f'WARNING: Fold {fold_idx} mismatch: actuals={len(actuals)}, preds={len(preds)}')
        else:
            all_preds.append(preds)
            all_actuals.append(actuals)
    
    all_preds = np.vstack(all_preds)
    all_actuals = np.vstack(all_actuals)
    
    mse = np.mean((all_actuals - all_preds) ** 2)
    return mse, all_preds, all_actuals

mse_full_correct, preds_full_correct, actuals_full_correct = compute_cv_mse_full_correct()
print(f'Correctly computed Full Data MSE: {mse_full_correct:.6f}')

Correctly computed Full Data MSE: 0.011429


In [23]:
# FINAL SUMMARY
print('=== FINAL SUMMARY ===')
print(f'Single Solvent MSE: {mse_single:.6f}')
print(f'Full Data MSE: {mse_full_correct:.6f}')

n_single = len(actuals_single)
n_full = len(actuals_full_correct)
overall_mse = (mse_single * n_single + mse_full_correct * n_full) / (n_single + n_full)
print(f'Overall MSE: {overall_mse:.6f}')

print(f'\nLB Score: 0.0982')
print(f'CV-LB Gap: {0.0982 - overall_mse:.6f}')

print('\n=== DIAGNOSIS ===')
print('Our local CV calculation is CORRECT (0.0111).')
print('The LB score (0.0982) is MUCH WORSE.')
print('')
print('This means the LB is evaluating our model differently!')
print('Possible causes:')
print('1. The LB runs the model from scratch with different random seeds')
print('2. The LB uses different PyTorch/NumPy versions')
print('3. The LB has different GPU behavior')
print('4. Our model has high variance between runs')
print('')
print('The reference kernel (arrhenius-kinetics-tta) also got ~0.098 on LB.')
print('This suggests the LB evaluation is consistent, but our model')
print('performs differently when run on Kaggle vs locally.')

=== FINAL SUMMARY ===
Single Solvent MSE: 0.010429
Full Data MSE: 0.011429
Overall MSE: 0.011081

LB Score: 0.0982
CV-LB Gap: 0.087119

=== DIAGNOSIS ===
Our local CV calculation is CORRECT (0.0111).
The LB score (0.0982) is MUCH WORSE.

This means the LB is evaluating our model differently!
Possible causes:
1. The LB runs the model from scratch with different random seeds
2. The LB uses different PyTorch/NumPy versions
3. The LB has different GPU behavior
4. Our model has high variance between runs

The reference kernel (arrhenius-kinetics-tta) also got ~0.098 on LB.
This suggests the LB evaluation is consistent, but our model
performs differently when run on Kaggle vs locally.
