# Loop 36 Analysis: Critical Assessment

**Situation:**
- 36 experiments completed
- 11 submissions made, 5 remaining
- Best LB: 0.0877 (exp_030)
- Target: 0.0347
- Gap: 2.53x (153% worse)

**Latest Experiment (exp_038):**
- Minimal features (8 vs 145): 19.91% WORSE (CV 0.009825 vs baseline 0.008194)
- CONFIRMS that DRFP and other features ARE valuable

**Key Question:** What approaches haven't we tried that could change the CV-LB relationship?

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

# All submission history
submissions = [
    {'exp': 'exp_000', 'cv': 0.0111, 'lb': 0.0982},
    {'exp': 'exp_001', 'cv': 0.0123, 'lb': 0.1065},
    {'exp': 'exp_003', 'cv': 0.0105, 'lb': 0.0972},
    {'exp': 'exp_005', 'cv': 0.0104, 'lb': 0.0969},
    {'exp': 'exp_006', 'cv': 0.0097, 'lb': 0.0946},
    {'exp': 'exp_007', 'cv': 0.0093, 'lb': 0.0932},
    {'exp': 'exp_009', 'cv': 0.0092, 'lb': 0.0936},
    {'exp': 'exp_012', 'cv': 0.0090, 'lb': 0.0913},
    {'exp': 'exp_024', 'cv': 0.0087, 'lb': 0.0893},
    {'exp': 'exp_026', 'cv': 0.0085, 'lb': 0.0887},
    {'exp': 'exp_030', 'cv': 0.0083, 'lb': 0.0877},
]

df = pd.DataFrame(submissions)
print('All submissions:')
print(df.to_string(index=False))

All submissions:
    exp     cv     lb
exp_000 0.0111 0.0982
exp_001 0.0123 0.1065
exp_003 0.0105 0.0972
exp_005 0.0104 0.0969
exp_006 0.0097 0.0946
exp_007 0.0093 0.0932
exp_009 0.0092 0.0936
exp_012 0.0090 0.0913
exp_024 0.0087 0.0893
exp_026 0.0085 0.0887
exp_030 0.0083 0.0877


In [5]:
# Linear regression on CV-LB relationship
cv_vals = df['cv'].values
lb_vals = df['lb'].values

slope, intercept, r_value, p_value, std_err = stats.linregress(cv_vals, lb_vals)

print(f'\n=== CV-LB Linear Fit ===')
print(f'LB = {slope:.2f} * CV + {intercept:.4f}')
print(f'R² = {r_value**2:.4f}')
print(f'Slope: {slope:.2f}')
print(f'Intercept: {intercept:.4f}')

# What CV would we need to reach target?
target = 0.0347
required_cv = (target - intercept) / slope
print(f'\n=== Target Analysis ===')
print(f'Target LB: {target}')
print(f'Required CV to reach target: {required_cv:.4f}')

if required_cv < 0:
    print(f'\n⚠️ With current CV-LB relationship, target appears unreachable!')
    print(f'Intercept ({intercept:.4f}) > Target ({target})')
    print(f'Even with CV=0, predicted LB would be {intercept:.4f}')
    print(f'\nBUT: The target IS reachable - we need to CHANGE the relationship!')


=== CV-LB Linear Fit ===
LB = 4.30 * CV + 0.0524
R² = 0.9675
Slope: 4.30
Intercept: 0.0524

=== Target Analysis ===
Target LB: 0.0347
Required CV to reach target: -0.0041

⚠️ With current CV-LB relationship, target appears unreachable!
Intercept (0.0524) > Target (0.0347)
Even with CV=0, predicted LB would be 0.0524

BUT: The target IS reachable - we need to CHANGE the relationship!


In [6]:
# Summary of all approaches tried
print('\n=== SUMMARY OF ALL APPROACHES TRIED ===')
approaches = {
    'MLP architectures': [
        ('exp_000', 'baseline [128,128,64]', 0.0111, 0.0982),
        ('exp_004', 'deep residual - FAILED', 0.0519, 'N/A'),
        ('exp_006', 'simpler [64,32]', 0.0097, 0.0946),
        ('exp_007', 'even simpler [32,16]', 0.0093, 0.0932),
    ],
    'Feature engineering': [
        ('exp_002', 'DRFP with PCA - WORSE', 0.0169, 'N/A'),
        ('exp_003', 'Spange+DRFP combined', 0.0105, 0.0972),
        ('exp_027', 'simple features (23) - WORSE', 0.0091, 'N/A'),
        ('exp_038', 'minimal features (8) - WORSE', 0.0098, 'N/A'),
    ],
    'Ensemble methods': [
        ('exp_005', '15 models', 0.0104, 0.0969),
        ('exp_011', 'diverse ensemble', 0.0090, 'N/A'),
        ('exp_012', 'simple ensemble', 0.0090, 0.0913),
        ('exp_028', '4-model ensemble - WORSE', 0.0087, 'N/A'),
    ],
    'GP models': [
        ('exp_030', 'GP+MLP+LGBM (0.15/0.55/0.3)', 0.0083, 0.0877),
        ('exp_031', 'higher GP weight', 0.0082, 'N/A'),
        ('exp_032', 'pure GP - WORSE', 0.0394, 'N/A'),
        ('exp_035', 'lower GP weight', 0.0082, 'N/A'),
    ],
    'Regularization': [
        ('exp_033', 'Ridge regression - WORSE', 0.0225, 'N/A'),
        ('exp_034', 'Kernel Ridge - WORSE', 0.0172, 'N/A'),
    ],
    'Distribution shift': [
        ('exp_037', 'similarity weighting - FAILED', 0.0263, 'N/A'),
    ],
}

for category, exps in approaches.items():
    print(f'\n{category}:')
    for exp, desc, cv, lb in exps:
        lb_str = f'LB {lb}' if lb != 'N/A' else 'not submitted'
        print(f'  {exp}: CV {cv:.4f} ({lb_str}) - {desc}')


=== SUMMARY OF ALL APPROACHES TRIED ===

MLP architectures:
  exp_000: CV 0.0111 (LB 0.0982) - baseline [128,128,64]
  exp_004: CV 0.0519 (not submitted) - deep residual - FAILED
  exp_006: CV 0.0097 (LB 0.0946) - simpler [64,32]
  exp_007: CV 0.0093 (LB 0.0932) - even simpler [32,16]

Feature engineering:
  exp_002: CV 0.0169 (not submitted) - DRFP with PCA - WORSE
  exp_003: CV 0.0105 (LB 0.0972) - Spange+DRFP combined
  exp_027: CV 0.0091 (not submitted) - simple features (23) - WORSE
  exp_038: CV 0.0098 (not submitted) - minimal features (8) - WORSE

Ensemble methods:
  exp_005: CV 0.0104 (LB 0.0969) - 15 models
  exp_011: CV 0.0090 (not submitted) - diverse ensemble
  exp_012: CV 0.0090 (LB 0.0913) - simple ensemble
  exp_028: CV 0.0087 (not submitted) - 4-model ensemble - WORSE

GP models:
  exp_030: CV 0.0083 (LB 0.0877) - GP+MLP+LGBM (0.15/0.55/0.3)
  exp_031: CV 0.0082 (not submitted) - higher GP weight
  exp_032: CV 0.0394 (not submitted) - pure GP - WORSE
  exp_035: CV 0.0

In [7]:
# What approaches HAVEN'T we tried?
print('\n=== APPROACHES NOT YET TRIED ===')
print('''
1. **Different CV scheme** (GroupKFold instead of Leave-One-Out)
   - The "mixall" kernel uses GroupKFold(5) and claims "good CV/LB"
   - BUT: Competition rules require the template's CV scheme
   - VERDICT: Cannot change CV scheme

2. **Per-target optimization**
   - Train separate models for SM, Product 2, Product 3
   - Each target might have different optimal features/hyperparameters
   - VERDICT: Worth trying

3. **Prediction post-processing**
   - Clip predictions to [0, 1]
   - Normalize so SM + Product 2 + Product 3 = 1?
   - VERDICT: Already clipping, but normalization might help

4. **Different base models**
   - CatBoost (handles categorical features natively)
   - TabNet (attention-based tabular model)
   - VERDICT: Worth trying

5. **Feature interactions**
   - Polynomial features
   - Interaction terms between kinetics and solvent properties
   - VERDICT: Worth trying

6. **Uncertainty-aware predictions**
   - Use GP uncertainty to weight predictions
   - Bayesian Neural Network
   - VERDICT: Partially tried with GP, could explore more

7. **Data augmentation beyond TTA**
   - Add noise to training data
   - Mixup augmentation
   - VERDICT: Worth trying
''')


=== APPROACHES NOT YET TRIED ===

1. **Different CV scheme** (GroupKFold instead of Leave-One-Out)
   - The "mixall" kernel uses GroupKFold(5) and claims "good CV/LB"
   - BUT: Competition rules require the template's CV scheme
   - VERDICT: Cannot change CV scheme

2. **Per-target optimization**
   - Train separate models for SM, Product 2, Product 3
   - Each target might have different optimal features/hyperparameters
   - VERDICT: Worth trying

3. **Prediction post-processing**
   - Clip predictions to [0, 1]
   - Normalize so SM + Product 2 + Product 3 = 1?
   - VERDICT: Already clipping, but normalization might help

4. **Different base models**
   - CatBoost (handles categorical features natively)
   - TabNet (attention-based tabular model)
   - VERDICT: Worth trying

5. **Feature interactions**
   - Polynomial features
   - Interaction terms between kinetics and solvent properties
   - VERDICT: Worth trying

6. **Uncertainty-aware predictions**
   - Use GP uncertainty to weigh

In [8]:
# Key insight: The CV-LB relationship
print('\n=== KEY INSIGHT: CV-LB RELATIONSHIP ===')
print(f'''
Current relationship: LB = {slope:.2f} * CV + {intercept:.4f}

This means:
- For every 0.001 improvement in CV, LB improves by {slope * 0.001:.4f}
- The intercept ({intercept:.4f}) represents the "base" LB error
- To reach target ({target}), we need to reduce the intercept

What could reduce the intercept?
1. Better generalization to unseen solvents
2. More robust features that don't overfit
3. Different model family with different bias-variance tradeoff

What we've learned:
- Simpler features (8 vs 145) made CV WORSE, not better
- Simpler models (Ridge, Kernel Ridge) made CV WORSE
- GP helps slightly but doesn't change the relationship
- Similarity weighting failed due to implementation bug

Conclusion:
The CV-LB gap is NOT due to overfitting to features or model complexity.
It's likely due to fundamental differences in how the competition evaluates.
''')


=== KEY INSIGHT: CV-LB RELATIONSHIP ===

Current relationship: LB = 4.30 * CV + 0.0524

This means:
- For every 0.001 improvement in CV, LB improves by 0.0043
- The intercept (0.0524) represents the "base" LB error
- To reach target (0.0347), we need to reduce the intercept

What could reduce the intercept?
1. Better generalization to unseen solvents
2. More robust features that don't overfit
3. Different model family with different bias-variance tradeoff

What we've learned:
- Simpler features (8 vs 145) made CV WORSE, not better
- Simpler models (Ridge, Kernel Ridge) made CV WORSE
- GP helps slightly but doesn't change the relationship
- Similarity weighting failed due to implementation bug

Conclusion:
The CV-LB gap is NOT due to overfitting to features or model complexity.
It's likely due to fundamental differences in how the competition evaluates.



In [9]:
# What's different about the competition evaluation?
print('\n=== HYPOTHESIS: COMPETITION EVALUATION DIFFERENCES ===')
print('''
Possible differences between our local CV and competition LB:

1. **Metric calculation**
   - We use MSE, competition might use different metric?
   - Check: Competition says MSE, so this is unlikely

2. **Fold construction**
   - We use Leave-One-Solvent-Out for single solvent
   - We use Leave-One-Ramp-Out for full data
   - Competition might use different folds?
   - Check: Template shows same CV scheme

3. **Data weighting**
   - We weight single solvent and full data equally by sample count
   - Competition might weight differently?
   - Check: Need to verify

4. **Prediction bounds**
   - We clip to [0, 1]
   - Competition might not clip, or might normalize?
   - Check: Need to verify

5. **Random seed effects**
   - Our models have some randomness
   - Competition might use different random state?
   - Check: Unlikely to cause 2.5x gap
''')


=== HYPOTHESIS: COMPETITION EVALUATION DIFFERENCES ===

Possible differences between our local CV and competition LB:

1. **Metric calculation**
   - We use MSE, competition might use different metric?
   - Check: Competition says MSE, so this is unlikely

2. **Fold construction**
   - We use Leave-One-Solvent-Out for single solvent
   - We use Leave-One-Ramp-Out for full data
   - Competition might use different folds?
   - Check: Template shows same CV scheme

3. **Data weighting**
   - We weight single solvent and full data equally by sample count
   - Competition might weight differently?
   - Check: Need to verify

4. **Prediction bounds**
   - We clip to [0, 1]
   - Competition might not clip, or might normalize?
   - Check: Need to verify

5. **Random seed effects**
   - Our models have some randomness
   - Competition might use different random state?
   - Check: Unlikely to cause 2.5x gap



In [10]:
# Final recommendation
print('\n' + '='*70)
print('FINAL RECOMMENDATION FOR LOOP 36')
print('='*70)
print(f'''
**SITUATION:**
- 5 submissions remaining
- Best LB: 0.0877 (exp_030)
- Target: 0.0347 (2.53x gap)
- CV-LB relationship: LB = {slope:.2f}*CV + {intercept:.4f}

**WHAT WE'VE LEARNED:**
1. Minimal features (8 vs 145) made CV 19.91% WORSE
   → DRFP and other features ARE valuable
2. All attempts to change CV-LB relationship have failed
3. The gap is structural, not due to overfitting

**PRIORITY 1: Per-Target Optimization**
- Train separate models for SM, Product 2, Product 3
- Each target might have different optimal features/hyperparameters
- This is a fundamentally different approach we haven't tried

**PRIORITY 2: CatBoost with Native Categorical Handling**
- CatBoost handles categorical features (solvent names) natively
- Might capture solvent-specific patterns better
- Could have different CV-LB relationship

**PRIORITY 3: Prediction Normalization**
- Normalize predictions so SM + Product 2 + Product 3 = 1
- Some kernels do this (mr0106/catechol)
- Might improve LB even if CV stays same

**DO NOT TRY:**
- More feature simplification (proven to hurt)
- More regularization (proven to hurt)
- Similarity weighting (implementation too complex)
''')


FINAL RECOMMENDATION FOR LOOP 36

**SITUATION:**
- 5 submissions remaining
- Best LB: 0.0877 (exp_030)
- Target: 0.0347 (2.53x gap)
- CV-LB relationship: LB = 4.30*CV + 0.0524

**WHAT WE'VE LEARNED:**
1. Minimal features (8 vs 145) made CV 19.91% WORSE
   → DRFP and other features ARE valuable
2. All attempts to change CV-LB relationship have failed
3. The gap is structural, not due to overfitting

**PRIORITY 1: Per-Target Optimization**
- Train separate models for SM, Product 2, Product 3
- Each target might have different optimal features/hyperparameters
- This is a fundamentally different approach we haven't tried

**PRIORITY 2: CatBoost with Native Categorical Handling**
- CatBoost handles categorical features (solvent names) natively
- Might capture solvent-specific patterns better
- Could have different CV-LB relationship

**PRIORITY 3: Prediction Normalization**
- Normalize predictions so SM + Product 2 + Product 3 = 1
- Some kernels do this (mr0106/catechol)
- Might improve LB

In [11]:
# Check if normalization makes sense
print('\n=== CHECKING IF NORMALIZATION MAKES SENSE ===')

# Load actual data
DATA_PATH = '/home/data'
df_single = pd.read_csv(f'{DATA_PATH}/catechol_single_solvent_yields.csv')
df_full = pd.read_csv(f'{DATA_PATH}/catechol_full_data_yields.csv')

# Check if targets sum to 1
df_single['sum'] = df_single['SM'] + df_single['Product 2'] + df_single['Product 3']
df_full['sum'] = df_full['SM'] + df_full['Product 2'] + df_full['Product 3']

print(f'Single Solvent target sums:')
print(f'  Mean: {df_single["sum"].mean():.4f}')
print(f'  Std: {df_single["sum"].std():.4f}')
print(f'  Min: {df_single["sum"].min():.4f}')
print(f'  Max: {df_single["sum"].max():.4f}')

print(f'\nFull Data target sums:')
print(f'  Mean: {df_full["sum"].mean():.4f}')
print(f'  Std: {df_full["sum"].std():.4f}')
print(f'  Min: {df_full["sum"].min():.4f}')
print(f'  Max: {df_full["sum"].max():.4f}')

print(f'\nConclusion: Targets do NOT sum to 1.0, so normalization is NOT appropriate.')


=== CHECKING IF NORMALIZATION MAKES SENSE ===
Single Solvent target sums:
  Mean: 0.7955
  Std: 0.1943
  Min: 0.0288
  Max: 1.0000

Full Data target sums:
  Mean: 0.8035
  Std: 0.2092
  Min: 0.0112
  Max: 1.1233

Conclusion: Targets do NOT sum to 1.0, so normalization is NOT appropriate.
