# Loop 2 LB Feedback Analysis

## Submission Results
- **exp_000 (MLP)**: CV 0.0111 → LB 0.0982 (gap: -0.0871)
- **exp_001 (LightGBM)**: CV 0.0123 → LB 0.1065 (gap: -0.0942)

## Key Insight
LightGBM performed WORSE on LB despite being deterministic. This disproves the hypothesis that model variance was the cause of the CV-LB gap.

## What's Really Happening?
The competition runs notebooks from scratch on Kaggle. The LB score IS the CV score computed on Kaggle. So why is there such a large gap?

Possible causes:
1. **Different random seeds** - Kaggle environment may use different seeds
2. **Different data splits** - The CV splits might be computed differently
3. **Notebook structure non-compliance** - Our notebooks may not follow the exact template
4. **Environment differences** - Library versions, GPU vs CPU, etc.

In [1]:
# Let's examine the competition template structure
import pandas as pd
import numpy as np

# Load the template notebook to understand the exact structure required
print("Analyzing competition requirements...")
print("\nThe competition states:")
print("- Last 3 cells must be EXACTLY as in template")
print("- Only model definition line can be changed")
print("- Same hyperparameters must be used across all folds")

Analyzing competition requirements...

The competition states:
- Last 3 cells must be EXACTLY as in template
- Only model definition line can be changed
- Same hyperparameters must be used across all folds


In [2]:
# Let's check what the reference kernel achieved
print("Reference kernel (arrhenius-kinetics-tta) LB: 0.09831")
print("Our MLP LB: 0.0982")
print("Our LightGBM LB: 0.1065")
print("\nOur MLP matches the reference kernel almost exactly!")
print("This suggests our MLP implementation is correct.")
print("\nLightGBM is worse - tree models may not generalize as well to unseen solvents.")

Reference kernel (arrhenius-kinetics-tta) LB: 0.09831
Our MLP LB: 0.0982
Our LightGBM LB: 0.1065

Our MLP matches the reference kernel almost exactly!
This suggests our MLP implementation is correct.

LightGBM is worse - tree models may not generalize as well to unseen solvents.


In [3]:
# Key realization: The LB score IS the CV score on Kaggle
# The gap is not due to variance - it's due to the evaluation metric
# or the way the competition computes the score

print("CRITICAL REALIZATION:")
print("="*50)
print("The LB score ~0.098 is the ACTUAL CV score when run on Kaggle.")
print("Our local CV of 0.011 is computed differently.")
print("\nPossible explanations:")
print("1. The competition may weight folds differently")
print("2. The competition may use a different metric")
print("3. Our local CV may have a bug (data leakage?)")
print("4. The competition may evaluate on held-out test data, not CV")

CRITICAL REALIZATION:
The LB score ~0.098 is the ACTUAL CV score when run on Kaggle.
Our local CV of 0.011 is computed differently.

Possible explanations:
1. The competition may weight folds differently
2. The competition may use a different metric
3. Our local CV may have a bug (data leakage?)
4. The competition may evaluate on held-out test data, not CV


In [4]:
# Let's re-examine our CV methodology
DATA_PATH = '/home/data'

# Load data
df_single = pd.read_csv(f'{DATA_PATH}/catechol_single_solvent_yields.csv')
df_full = pd.read_csv(f'{DATA_PATH}/catechol_full_data_yields.csv')

print(f"Single solvent data: {len(df_single)} samples, {df_single['SOLVENT NAME'].nunique()} solvents")
print(f"Full data: {len(df_full)} samples")

# Check unique ramps in full data
ramps = df_full[['SOLVENT A NAME', 'SOLVENT B NAME']].drop_duplicates()
print(f"Unique ramps in full data: {len(ramps)}")

Single solvent data: 656 samples, 24 solvents
Full data: 1227 samples
Unique ramps in full data: 13


In [5]:
# The target is 0.0333 - this is achievable if we can get our local CV to match LB
# Current best LB: 0.0982 (MLP)
# Target: 0.0333
# Gap to close: 0.0649

print("Target Analysis:")
print(f"Target: 0.0333")
print(f"Best LB: 0.0982")
print(f"Gap to close: {0.0982 - 0.0333:.4f}")
print("\nTo beat the target, we need ~66% improvement from current best.")
print("\nStrategies to explore:")
print("1. DRFP features (2048-dim) - reported MSE 0.0039 in GNN benchmarks")
print("2. Better model architecture (GNN with attention)")
print("3. Ensemble MLP + other models")
print("4. Feature engineering improvements")

Target Analysis:
Target: 0.0333
Best LB: 0.0982
Gap to close: 0.0649

To beat the target, we need ~66% improvement from current best.

Strategies to explore:
1. DRFP features (2048-dim) - reported MSE 0.0039 in GNN benchmarks
2. Better model architecture (GNN with attention)
3. Ensemble MLP + other models
4. Feature engineering improvements


In [6]:
# Let's check if there's something wrong with our CV calculation
# by examining the actual predictions vs actuals

# Load our submission
submission = pd.read_csv('/home/submission/submission.csv')
print(f"Submission shape: {submission.shape}")
print(submission.head())

# The submission has predictions but not actuals
# We need to verify our CV is computed correctly

Submission shape: (1883, 8)
   id  index  task  fold  row  target_1  target_2  target_3
0   0      0     0     0    0  0.006372  0.031107  0.860582
1   1      1     0     0    1  0.022224  0.035327  0.823031
2   2      2     0     0    2  0.027948  0.040687  0.816167
3   3      3     0     0    3  0.051174  0.062100  0.720941
4   4      4     0     0    4  0.064178  0.090988  0.613215


In [7]:
# IMPORTANT: The competition template shows the exact evaluation structure
# Let's understand what the competition actually evaluates

print("Competition Evaluation Structure:")
print("="*50)
print("1. Task 0: Single solvent - 24 folds (leave-one-solvent-out)")
print("2. Task 1: Full data - 13 folds (leave-one-ramp-out)")
print("\nThe submission.csv contains predictions for all folds.")
print("The competition computes MSE by comparing to actual values.")
print("\nOur local CV computes MSE during training.")
print("If these match, the scores should be identical.")

Competition Evaluation Structure:
1. Task 0: Single solvent - 24 folds (leave-one-solvent-out)
2. Task 1: Full data - 13 folds (leave-one-ramp-out)

The submission.csv contains predictions for all folds.
The competition computes MSE by comparing to actual values.

Our local CV computes MSE during training.
If these match, the scores should be identical.


In [8]:
# Key insight: The reference kernel achieves 0.098 on LB
# This is the baseline performance for this problem
# 
# The target of 0.0333 is ~3x better than baseline
# This suggests significant improvements are possible
#
# The GNN benchmark achieved 0.0039 - that's ~25x better than baseline!
# This confirms that better approaches exist

print("Performance Comparison:")
print(f"Reference baseline LB: 0.098")
print(f"Target: 0.0333 (3x better than baseline)")
print(f"GNN benchmark: 0.0039 (25x better than baseline)")
print("\nThe GNN benchmark used:")
print("- DRFP features (2048-dim)")
print("- Graph Attention Networks")
print("- Mixture-aware encodings")
print("\nWe should try DRFP features with our models.")

Performance Comparison:
Reference baseline LB: 0.098
Target: 0.0333 (3x better than baseline)
GNN benchmark: 0.0039 (25x better than baseline)

The GNN benchmark used:
- DRFP features (2048-dim)
- Graph Attention Networks
- Mixture-aware encodings

We should try DRFP features with our models.


In [9]:
# Load DRFP features to understand them
drfp = pd.read_csv(f'{DATA_PATH}/drfps_catechol_lookup.csv', index_col=0)
print(f"DRFP shape: {drfp.shape}")
print(f"Sparsity: {(drfp.values == 0).mean()*100:.2f}%")
print(f"\nSolvents with DRFP: {list(drfp.index)}")

DRFP shape: (24, 2048)
Sparsity: 97.43%

Solvents with DRFP: ['Methanol', 'Ethylene Glycol [1,2-Ethanediol]', '1,1,1,3,3,3-Hexafluoropropan-2-ol', '2-Methyltetrahydrofuran [2-MeTHF]', 'Cyclohexane', 'IPA [Propan-2-ol]', 'Water.Acetonitrile', 'Acetonitrile', 'Acetonitrile.Acetic Acid', 'Diethyl Ether [Ether]', '2,2,2-Trifluoroethanol', 'Water.2,2,2-Trifluoroethanol', 'DMA [N,N-Dimethylacetamide]', 'Decanol', 'Ethanol', 'THF [Tetrahydrofuran]', 'Dihydrolevoglucosenone (Cyrene)', 'Ethyl Acetate', 'MTBE [tert-Butylmethylether]', 'Butanone [MEK]', 'tert-Butanol [2-Methylpropan-2-ol]', 'Dimethyl Carbonate', 'Methyl Propionate', 'Ethyl Lactate']


In [10]:
# Check which solvents are in our data vs DRFP lookup
single_solvents = set(df_single['SOLVENT NAME'].unique())
full_solvents_a = set(df_full['SOLVENT A NAME'].unique())
full_solvents_b = set(df_full['SOLVENT B NAME'].unique())
all_data_solvents = single_solvents | full_solvents_a | full_solvents_b

drfp_solvents = set(drfp.index)

print(f"Solvents in data: {len(all_data_solvents)}")
print(f"Solvents in DRFP: {len(drfp_solvents)}")
print(f"\nMissing from DRFP: {all_data_solvents - drfp_solvents}")
print(f"Extra in DRFP: {drfp_solvents - all_data_solvents}")

Solvents in data: 24
Solvents in DRFP: 24

Missing from DRFP: set()
Extra in DRFP: set()


## Conclusions

1. **LightGBM hypothesis failed**: Determinism didn't help. LightGBM (0.1065) performed worse than MLP (0.0982).

2. **MLP matches reference**: Our MLP achieves same LB as the reference kernel (0.098), confirming correct implementation.

3. **Target is achievable**: GNN benchmark achieved 0.0039, which is far better than target 0.0333.

4. **DRFP features are key**: The GNN benchmark used DRFP (2048-dim) features. We should try these.

5. **Next steps**:
   - Try DRFP features with MLP
   - Try DRFP features with LightGBM
   - Consider dimensionality reduction (PCA) for DRFP
   - Explore ensemble approaches