# Loop 1 LB Feedback Analysis

## Critical Issue: CV 0.0113 vs LB 0.0998 (9x gap!)

This is a MASSIVE gap that needs investigation. The CV score was excellent but the LB score is terrible.

### Possible Causes:
1. **Submission format issue** - The notebook didn't follow the template structure
2. **Data leakage in CV** - Our CV might be overly optimistic
3. **Distribution shift** - Test data differs significantly from training
4. **Overfitting** - Model memorized training patterns that don't generalize

In [1]:
import pandas as pd
import numpy as np

# Load our submission
submission = pd.read_csv('/home/code/experiments/001_baseline/submission.csv')
print('Our submission shape:', submission.shape)
print('\nColumns:', submission.columns.tolist())
print('\nFirst 5 rows:')
print(submission.head())

Our submission shape: (1883, 8)

Columns: ['id', 'index', 'task', 'fold', 'row', 'target_1', 'target_2', 'target_3']

First 5 rows:
   id  index  task  fold  row  target_1  target_2  target_3
0   0      0     0     0    0  0.008308  0.011011  0.931947
1   1      1     0     0    1  0.018124  0.021161  0.886760
2   2      2     0     0    2  0.040722  0.039450  0.796149
3   3      3     0     0    3  0.070126  0.057600  0.702190
4   4      4     0     0    4  0.093846  0.070580  0.628814


In [2]:
# Check the expected format from template
# The template produces: id, index, task, fold, row, target_1, target_2, target_3
print('\nTask distribution:')
print(submission['task'].value_counts())

print('\nFold distribution for task 0 (single solvent):')
print(submission[submission['task']==0]['fold'].value_counts().sort_index())

print('\nFold distribution for task 1 (full data):')
print(submission[submission['task']==1]['fold'].value_counts().sort_index())


Task distribution:
task
1    1227
0     656
Name: count, dtype: int64

Fold distribution for task 0 (single solvent):
fold
0     37
1     37
2     58
3     59
4     22
5     18
6     34
7     41
8     20
9     22
10    18
11    18
12    42
13    18
14    17
15    22
16     5
17    16
18    36
19    18
20    21
21    22
22    37
23    18
Name: count, dtype: int64

Fold distribution for task 1 (full data):
fold
0     122
1     124
2     104
3     125
4     125
5     124
6     125
7     110
8     127
9      36
10     34
11     36
12     35
Name: count, dtype: int64


In [3]:
# Check target ranges
print('\nTarget statistics:')
for col in ['target_1', 'target_2', 'target_3']:
    print(f'{col}: min={submission[col].min():.4f}, max={submission[col].max():.4f}, mean={submission[col].mean():.4f}')


Target statistics:
target_1: min=0.0001, max=0.4477, mean=0.1626
target_2: min=0.0001, max=0.4257, mean=0.1392
target_3: min=0.0000, max=0.9983, mean=0.5162


In [4]:
# Load actual data to compare
DATA_PATH = '/home/data'

single_df = pd.read_csv(f'{DATA_PATH}/catechol_single_solvent_yields.csv')
full_df = pd.read_csv(f'{DATA_PATH}/catechol_full_data_yields.csv')

print('Single solvent data shape:', single_df.shape)
print('Full data shape:', full_df.shape)

print('\nActual target statistics (single solvent):')
for col in ['Product 2', 'Product 3', 'SM']:
    print(f'{col}: min={single_df[col].min():.4f}, max={single_df[col].max():.4f}, mean={single_df[col].mean():.4f}')

Single solvent data shape: (656, 13)
Full data shape: (1227, 19)

Actual target statistics (single solvent):
Product 2: min=0.0000, max=0.4636, mean=0.1499
Product 3: min=0.0000, max=0.5338, mean=0.1234
SM: min=0.0000, max=1.0000, mean=0.5222


In [5]:
# KEY INSIGHT: Check if our submission has the correct number of rows
# Single solvent: 656 samples, 24 folds (leave-one-out)
# Full data: 1227 samples, 13 folds (leave-one-ramp-out)

print('Expected rows:')
print(f'  Single solvent: 656')
print(f'  Full data: 1227')
print(f'  Total: 1883')

print('\nActual rows in submission:')
print(f'  Task 0 (single): {len(submission[submission["task"]==0])}')
print(f'  Task 1 (full): {len(submission[submission["task"]==1])}')
print(f'  Total: {len(submission)}')

Expected rows:
  Single solvent: 656
  Full data: 1227
  Total: 1883

Actual rows in submission:
  Task 0 (single): 656
  Task 1 (full): 1227
  Total: 1883


In [6]:
# CRITICAL CHECK: The competition likely expects predictions in a specific order
# Let's check if our fold/row ordering matches what's expected

# For single solvent, each fold should have ~27 samples (656/24)
print('\nSamples per fold (single solvent):')
for fold in range(24):
    count = len(submission[(submission['task']==0) & (submission['fold']==fold)])
    print(f'  Fold {fold}: {count} samples')

print('\nSamples per fold (full data):')
for fold in range(13):
    count = len(submission[(submission['task']==1) & (submission['fold']==fold)])
    print(f'  Fold {fold}: {count} samples')


Samples per fold (single solvent):
  Fold 0: 37 samples
  Fold 1: 37 samples
  Fold 2: 58 samples
  Fold 3: 59 samples
  Fold 4: 22 samples
  Fold 5: 18 samples
  Fold 6: 34 samples
  Fold 7: 41 samples
  Fold 8: 20 samples
  Fold 9: 22 samples
  Fold 10: 18 samples
  Fold 11: 18 samples
  Fold 12: 42 samples
  Fold 13: 18 samples
  Fold 14: 17 samples
  Fold 15: 22 samples
  Fold 16: 5 samples
  Fold 17: 16 samples
  Fold 18: 36 samples
  Fold 19: 18 samples
  Fold 20: 21 samples
  Fold 21: 22 samples
  Fold 22: 37 samples
  Fold 23: 18 samples

Samples per fold (full data):
  Fold 0: 122 samples
  Fold 1: 124 samples
  Fold 2: 104 samples
  Fold 3: 125 samples
  Fold 4: 125 samples
  Fold 5: 124 samples
  Fold 6: 125 samples
  Fold 7: 110 samples
  Fold 8: 127 samples
  Fold 9: 36 samples
  Fold 10: 34 samples
  Fold 11: 36 samples
  Fold 12: 35 samples


## Key Observations

1. **The submission format looks correct** - 1883 rows total (656 + 1227)
2. **The target ranges look reasonable** - values between 0 and 1
3. **The fold structure looks correct** - 24 folds for single, 13 for full

### The Real Issue: CV-LB Gap

The massive CV-LB gap (0.0113 vs 0.0998) suggests:

1. **Our local CV is NOT representative of the actual test set**
2. **The competition might use a different evaluation method**
3. **There might be a bug in how we compute CV vs how Kaggle computes it**

### Looking at Public Kernels

The best public kernel (Arrhenius Kinetics + TTA) achieves **0.09831** on LB, which is very close to our LB score of 0.0998. This suggests:

1. Our model is actually performing similarly to the best public kernel
2. The CV score of 0.0113 was likely computed incorrectly or is not comparable to LB
3. The target of 0.017270 might be achievable but requires different approaches

In [7]:
# Let's understand what the competition is actually measuring
# The LB score of ~0.098-0.10 seems to be the baseline for good models
# Our CV of 0.0113 was computed as MSE across all predictions

# Let's recalculate to understand the discrepancy
print('Understanding the scoring:')
print('\nOur CV MSE: 0.011303')
print('Our LB MSE: 0.0998')
print('Best public kernel LB: 0.09831')
print('\nTarget to beat: 0.017270')

print('\n\nThe gap suggests our CV calculation might be different from LB calculation.')
print('Possible issues:')
print('1. LB might weight tasks differently')
print('2. LB might use different folds')
print('3. Our CV might have data leakage')

Understanding the scoring:

Our CV MSE: 0.011303
Our LB MSE: 0.0998
Best public kernel LB: 0.09831

Target to beat: 0.017270


The gap suggests our CV calculation might be different from LB calculation.
Possible issues:
1. LB might weight tasks differently
2. LB might use different folds
3. Our CV might have data leakage
