# Loop 4 Analysis: Understanding CV-LB Gap and Target

Key questions:
1. Why is there a 9x gap between CV (0.010) and LB (0.0998)?
2. How can we reach the target of 0.017270 when best public LB is ~0.098?
3. What's the actual evaluation metric?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Load data
DATA_PATH = '/home/data'
full_df = pd.read_csv(f'{DATA_PATH}/catechol_full_data_yields.csv')
single_df = pd.read_csv(f'{DATA_PATH}/catechol_single_solvent_yields.csv')

print(f"Full data: {len(full_df)} samples")
print(f"Single solvent: {len(single_df)} samples")
print(f"Total: {len(full_df) + len(single_df)} samples")

Full data: 1227 samples
Single solvent: 656 samples
Total: 1883 samples


In [2]:
# Analyze the submission format and scoring
# The competition says: "Submissions will be evaluated according to a cross-validation procedure"
# The metric is 'catechol_hackathon_metric'

# Let's look at our submission
submission = pd.read_csv('/home/submission/submission.csv')
print("Submission shape:", submission.shape)
print("\nSubmission columns:", submission.columns.tolist())
print("\nSubmission head:")
print(submission.head(10))

Submission shape: (1883, 8)

Submission columns: ['id', 'index', 'task', 'fold', 'row', 'target_1', 'target_2', 'target_3']

Submission head:
   id  index  task  fold  row  target_1  target_2  target_3
0   0      0     0     0    0  0.007449  0.007649  0.864090
1   1      1     0     0    1  0.015693  0.015412  0.855874
2   2      2     0     0    2  0.044828  0.041933  0.766220
3   3      3     0     0    3  0.081143  0.070521  0.651426
4   4      4     0     0    4  0.109361  0.087495  0.561279
5   5      5     0     0    5  0.119963  0.090349  0.541460
6   6      6     0     0    6  0.120002  0.093689  0.539822
7   7      7     0     0    7  0.119996  0.093778  0.540977
8   8      8     0     0    8  0.120203  0.084236  0.523479
9   9      9     0     0    9  0.120203  0.084236  0.523479


In [3]:
# Check task distribution
print("Task distribution:")
print(submission['task'].value_counts())

# Task 0 = single solvent (656 samples)
# Task 1 = full data (1227 samples)
print(f"\nTask 0 (single): {len(submission[submission['task']==0])} rows")
print(f"Task 1 (full): {len(submission[submission['task']==1])} rows")

Task distribution:
task
1    1227
0     656
Name: count, dtype: int64

Task 0 (single): 656 rows
Task 1 (full): 1227 rows


In [4]:
# Key insight: The LB score of 0.0998 is close to 0.1
# Our CV MSE is 0.010298
# If LB is MSE, then 0.0998 vs 0.010 is a 10x gap
# If LB is RMSE, then sqrt(0.010) = 0.1 which matches!

print("CV MSE:", 0.010298)
print("CV RMSE:", np.sqrt(0.010298))
print("LB score:", 0.0998)
print("\nIf LB is RMSE, CV RMSE matches LB!")
print(f"CV RMSE = {np.sqrt(0.010298):.4f} vs LB = 0.0998")

CV MSE: 0.010298
CV RMSE: 0.10147906187977893
LB score: 0.0998

If LB is RMSE, CV RMSE matches LB!
CV RMSE = 0.1015 vs LB = 0.0998


In [5]:
# CRITICAL INSIGHT: The LB metric is likely RMSE, not MSE!
# This explains the 9x gap:
# - Our CV MSE = 0.010298
# - Our CV RMSE = sqrt(0.010298) = 0.1015
# - LB score = 0.0998 (very close to our RMSE!)

# So the target of 0.017270 is likely RMSE, not MSE!
# Target RMSE = 0.017270 means Target MSE = 0.017270^2 = 0.000298

target_rmse = 0.017270
target_mse = target_rmse ** 2

print(f"Target (if RMSE): {target_rmse}")
print(f"Target (if MSE): {target_mse:.6f}")
print(f"\nOur current CV MSE: 0.010298")
print(f"Our current CV RMSE: {np.sqrt(0.010298):.4f}")
print(f"\nTo reach target RMSE of {target_rmse}:")
print(f"  We need MSE of {target_mse:.6f}")
print(f"  Current MSE is 0.010298")
print(f"  Need to reduce MSE by {0.010298 / target_mse:.1f}x")

Target (if RMSE): 0.01727
Target (if MSE): 0.000298

Our current CV MSE: 0.010298
Our current CV RMSE: 0.1015

To reach target RMSE of 0.01727:
  We need MSE of 0.000298
  Current MSE is 0.010298
  Need to reduce MSE by 34.5x


In [6]:
# Wait - let's reconsider. The target of 0.017270 is MUCH lower than our LB of 0.0998
# If LB is RMSE and target is RMSE, then target is 5.8x better than current best
# This is a HUGE gap

# Let's check if the target might be MSE instead
# If target is MSE = 0.017270, then target RMSE = sqrt(0.017270) = 0.1314
# But our LB is 0.0998, which is BETTER than 0.1314
# So target can't be MSE if LB is RMSE

print("Scenario 1: Both LB and target are RMSE")
print(f"  LB RMSE: 0.0998")
print(f"  Target RMSE: 0.017270")
print(f"  Gap: {0.0998 / 0.017270:.1f}x")
print()
print("Scenario 2: LB is RMSE, target is MSE")
print(f"  LB RMSE: 0.0998 -> LB MSE: {0.0998**2:.6f}")
print(f"  Target MSE: 0.017270")
print(f"  Our LB MSE ({0.0998**2:.6f}) is BETTER than target MSE (0.017270)!")
print("  This means we've ALREADY beaten the target!")
print()
print("Scenario 3: Both LB and target are MSE")
print(f"  LB MSE: 0.0998")
print(f"  Target MSE: 0.017270")
print(f"  Gap: {0.0998 / 0.017270:.1f}x (target is better)")

Scenario 1: Both LB and target are RMSE
  LB RMSE: 0.0998
  Target RMSE: 0.017270
  Gap: 5.8x

Scenario 2: LB is RMSE, target is MSE
  LB RMSE: 0.0998 -> LB MSE: 0.009960
  Target MSE: 0.017270
  Our LB MSE (0.009960) is BETTER than target MSE (0.017270)!
  This means we've ALREADY beaten the target!

Scenario 3: Both LB and target are MSE
  LB MSE: 0.0998
  Target MSE: 0.017270
  Gap: 5.8x (target is better)


In [7]:
# Let's verify by checking the actual predictions vs actuals
# Load our predictions and compare to ground truth

# Single solvent data
X_single = single_df[['Residence Time', 'Temperature', 'SOLVENT NAME']]
Y_single = single_df[['Product 2', 'Product 3', 'SM']]

# Full data
X_full = full_df[['Residence Time', 'Temperature', 'SOLVENT A NAME', 'SOLVENT B NAME', 'SolventB%']]
Y_full = full_df[['Product 2', 'Product 3', 'SM']]

print("Single solvent targets:")
print(Y_single.describe())
print("\nFull data targets:")
print(Y_full.describe())

Single solvent targets:
        Product 2   Product 3          SM
count  656.000000  656.000000  656.000000
mean     0.149932    0.123380    0.522192
std      0.143136    0.131528    0.360229
min      0.000000    0.000000    0.000000
25%      0.012976    0.009445    0.145001
50%      0.102813    0.078298    0.656558
75%      0.281654    0.193353    0.857019
max      0.463632    0.533768    1.000000

Full data targets:
         Product 2    Product 3           SM
count  1227.000000  1227.000000  1227.000000
mean      0.164626     0.143668     0.495178
std       0.153467     0.145779     0.379425
min       0.000000     0.000000     0.000000
25%       0.012723     0.012260     0.068573
50%       0.117330     0.094413     0.606454
75%       0.308649     0.254630     0.877448
max       0.463632     0.533768     1.083254


In [8]:
# The key question: What is the actual evaluation metric?
# Looking at the competition description:
# "Submissions will be evaluated according to a cross-validation procedure"
# "catechol_hackathon_metric"

# The template notebook shows the CV procedure but doesn't show the metric calculation
# Let's check if there's a pattern in the LB scores

# Our submissions:
# exp_000: CV MSE 0.0113, LB 0.0998
# exp_001: CV MSE 0.0110, LB 0.0999

# The LB scores are nearly identical despite different CV scores
# This suggests the LB might be evaluating something different

print("Submission history:")
print("exp_000: CV MSE 0.0113 -> LB 0.0998")
print("exp_001: CV MSE 0.0110 -> LB 0.0999")
print("exp_003: CV MSE 0.0103 -> LB ???")
print()
print("CV RMSE values:")
print(f"exp_000: {np.sqrt(0.0113):.4f}")
print(f"exp_001: {np.sqrt(0.0110):.4f}")
print(f"exp_003: {np.sqrt(0.0103):.4f}")
print()
print("The CV RMSE values are close to LB scores!")
print("This strongly suggests LB metric is RMSE.")

Submission history:
exp_000: CV MSE 0.0113 -> LB 0.0998
exp_001: CV MSE 0.0110 -> LB 0.0999
exp_003: CV MSE 0.0103 -> LB ???

CV RMSE values:
exp_000: 0.1063
exp_001: 0.1049
exp_003: 0.1015

The CV RMSE values are close to LB scores!
This strongly suggests LB metric is RMSE.


In [9]:
# CRITICAL REALIZATION:
# If LB is RMSE and target is 0.017270, then:
# - Current best LB RMSE: ~0.0998
# - Target RMSE: 0.017270
# - We need to improve by 5.8x

# This is a MASSIVE improvement needed.
# The best public kernels achieve ~0.098 LB
# The target of 0.017270 is 5.7x better

# This suggests either:
# 1. The target is achievable with a fundamentally different approach
# 2. The target represents a different metric/evaluation
# 3. There's information we're not using

# Let's think about what could give such a large improvement:
# - Better features (drfps, fragprints are high-dimensional)
# - Better model architecture
# - Post-processing / calibration
# - Understanding the chemistry better

print("To reach target RMSE of 0.017270:")
print(f"  Current RMSE: ~0.10")
print(f"  Target RMSE: 0.017270")
print(f"  Improvement needed: {0.10 / 0.017270:.1f}x")
print()
print("This is equivalent to:")
print(f"  Current MSE: ~0.01")
print(f"  Target MSE: {0.017270**2:.6f}")
print(f"  Improvement needed: {0.01 / (0.017270**2):.1f}x")

To reach target RMSE of 0.017270:
  Current RMSE: ~0.10
  Target RMSE: 0.017270
  Improvement needed: 5.8x

This is equivalent to:
  Current MSE: ~0.01
  Target MSE: 0.000298
  Improvement needed: 33.5x


In [10]:
# Let's analyze per-target errors to understand where the error comes from
# Load our best submission predictions

# Read the stacking ensemble predictions
submission = pd.read_csv('/home/code/experiments/004_stacking_ensemble/submission.csv')
print("Submission shape:", submission.shape)
print(submission.head())

Submission shape: (1883, 8)
   id  index  task  fold  row  target_1  target_2  target_3
0   0      0     0     0    0  0.007449  0.007649  0.864090
1   1      1     0     0    1  0.015693  0.015412  0.855874
2   2      2     0     0    2  0.044828  0.041933  0.766220
3   3      3     0     0    3  0.081143  0.070521  0.651426
4   4      4     0     0    4  0.109361  0.087495  0.561279


In [11]:
# The submission format is:
# id, index, task, fold, row, target_1, target_2, target_3
# where target_1=Product 2, target_2=Product 3, target_3=SM

# We need to match predictions to actuals by fold and row
# This is complex because the fold/row indices depend on the CV split

# Instead, let's look at the distribution of predictions
print("Prediction statistics:")
print(f"target_1 (Product 2): mean={submission['target_1'].mean():.4f}, std={submission['target_1'].std():.4f}")
print(f"target_2 (Product 3): mean={submission['target_2'].mean():.4f}, std={submission['target_2'].std():.4f}")
print(f"target_3 (SM): mean={submission['target_3'].mean():.4f}, std={submission['target_3'].std():.4f}")
print()
print("Actual statistics (single solvent):")
print(f"Product 2: mean={Y_single['Product 2'].mean():.4f}, std={Y_single['Product 2'].std():.4f}")
print(f"Product 3: mean={Y_single['Product 3'].mean():.4f}, std={Y_single['Product 3'].std():.4f}")
print(f"SM: mean={Y_single['SM'].mean():.4f}, std={Y_single['SM'].std():.4f}")

Prediction statistics:
target_1 (Product 2): mean=0.1607, std=0.1345
target_2 (Product 3): mean=0.1414, std=0.1223
target_3 (SM): mean=0.5096, std=0.3519

Actual statistics (single solvent):
Product 2: mean=0.1499, std=0.1431
Product 3: mean=0.1234, std=0.1315
SM: mean=0.5222, std=0.3602


In [12]:
# Key observation: SM has much higher variance (std ~0.35) than Products (~0.10)
# This means SM contributes more to the error

# Let's think about what could dramatically improve predictions:
# 1. Better SM prediction (highest variance target)
# 2. Using high-dimensional features (drfps, fragprints)
# 3. Regressor chains (predict SM first, use as input for Products)
# 4. Per-solvent models (if some solvents are easier to predict)

print("\n=== STRATEGY RECOMMENDATIONS ===")
print("\n1. SUBMIT exp_003 (stacking ensemble) to verify CV-LB correlation")
print("   - CV MSE: 0.010298 (best so far)")
print("   - Expected LB RMSE: ~0.101")
print()
print("2. If LB improves, continue with ensemble optimization:")
print("   - Try different ensemble weights (not just 50/50)")
print("   - Add more diverse models (LightGBM, XGBoost)")
print("   - Try regressor chains for correlated targets")
print()
print("3. If LB doesn't improve, investigate:")
print("   - High-dimensional features (drfps, fragprints)")
print("   - Per-fold error analysis")
print("   - Different CV strategy")


=== STRATEGY RECOMMENDATIONS ===

1. SUBMIT exp_003 (stacking ensemble) to verify CV-LB correlation
   - CV MSE: 0.010298 (best so far)
   - Expected LB RMSE: ~0.101

2. If LB improves, continue with ensemble optimization:
   - Try different ensemble weights (not just 50/50)
   - Add more diverse models (LightGBM, XGBoost)
   - Try regressor chains for correlated targets

3. If LB doesn't improve, investigate:
   - High-dimensional features (drfps, fragprints)
   - Per-fold error analysis
   - Different CV strategy
