# Loop 36 Analysis: Critical Assessment

**Current State:**
- Best CV: 0.008194 (exp_032: GP 0.15 + MLP 0.55 + LGBM 0.3)
- Best LB: 0.0877 (exp_030)
- Target: 0.0347
- Gap to target: 2.53x
- Submissions remaining: 5

**Latest experiments (exp_037 & exp_038):**
- Both similarity weighting experiments failed with IDENTICAL CV (0.022076)
- This is 169% WORSE than baseline
- The identical results suggest a bug in the implementation

**Key Question:** What's causing the failure and what should we try next?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

# Complete submission history
submissions = [
    {'exp': 'exp_000', 'cv': 0.0111, 'lb': 0.0982},
    {'exp': 'exp_001', 'cv': 0.0123, 'lb': 0.1065},
    {'exp': 'exp_003', 'cv': 0.0105, 'lb': 0.0972},
    {'exp': 'exp_005', 'cv': 0.0104, 'lb': 0.0969},
    {'exp': 'exp_006', 'cv': 0.0097, 'lb': 0.0946},
    {'exp': 'exp_007', 'cv': 0.0093, 'lb': 0.0932},
    {'exp': 'exp_009', 'cv': 0.0092, 'lb': 0.0936},
    {'exp': 'exp_012', 'cv': 0.0090, 'lb': 0.0913},
    {'exp': 'exp_024', 'cv': 0.0087, 'lb': 0.0893},
    {'exp': 'exp_026', 'cv': 0.0085, 'lb': 0.0887},
    {'exp': 'exp_030', 'cv': 0.0083, 'lb': 0.0877},
]

df = pd.DataFrame(submissions)
print('=== Submission History ===')
print(df)
print(f'\nCV-LB Ratio: {df["lb"].mean() / df["cv"].mean():.2f}x')

=== Submission History ===
        exp      cv      lb
0   exp_000  0.0111  0.0982
1   exp_001  0.0123  0.1065
2   exp_003  0.0105  0.0972
3   exp_005  0.0104  0.0969
4   exp_006  0.0097  0.0946
5   exp_007  0.0093  0.0932
6   exp_009  0.0092  0.0936
7   exp_012  0.0090  0.0913
8   exp_024  0.0087  0.0893
9   exp_026  0.0085  0.0887
10  exp_030  0.0083  0.0877

CV-LB Ratio: 9.69x


In [2]:
# Linear regression to understand CV-LB relationship
slope, intercept, r_value, p_value, std_err = stats.linregress(df['cv'], df['lb'])
print(f'Linear fit: LB = {slope:.2f} * CV + {intercept:.4f}')
print(f'R² = {r_value**2:.4f}')
print(f'\nIntercept: {intercept:.4f}')
print(f'Target LB: 0.0347')

# Predict LB for our best CV models
best_cv = 0.008194  # exp_032
predicted_lb = slope * best_cv + intercept
print(f'\nPredicted LB for exp_032 (CV={best_cv}): {predicted_lb:.4f}')

# What CV would we need to hit target?
target_lb = 0.0347
required_cv = (target_lb - intercept) / slope
print(f'\nTo reach target LB = {target_lb}:')
print(f'  Required CV = ({target_lb} - {intercept:.4f}) / {slope:.2f} = {required_cv:.6f}')

if required_cv < 0:
    print('\n⚠️ With current CV-LB relationship, target is unreachable!')
    print(f'The intercept alone ({intercept:.4f}) is {intercept/target_lb:.2f}x higher than target ({target_lb})')
    print('\nBUT: This means we need to CHANGE the relationship, not just improve CV!')

Linear fit: LB = 4.30 * CV + 0.0524
R² = 0.9675

Intercept: 0.0524
Target LB: 0.0347

Predicted LB for exp_032 (CV=0.008194): 0.0877

To reach target LB = 0.0347:
  Required CV = (0.0347 - 0.0524) / 4.30 = -0.004118

⚠️ With current CV-LB relationship, target is unreachable!
The intercept alone (0.0524) is 1.51x higher than target (0.0347)

BUT: This means we need to CHANGE the relationship, not just improve CV!


In [3]:
# Analyze the similarity weighting failure
print('=== SIMILARITY WEIGHTING FAILURE ANALYSIS ===')
print()
print('exp_037 (Similarity Weighting): CV 0.022076')
print('exp_038 (Inverse Similarity Weighting): CV 0.022076')
print()
print('IDENTICAL RESULTS! This is highly suspicious.')
print()
print('Evaluator identified the bug:')
print('  - Baseline uses WeightedHuberLoss with target weights [1.0, 1.0, 2.0]')
print('  - Similarity experiments dropped target weighting entirely!')
print('  - Loss computation was different:')
print('    Baseline: loss_fn(pred, y_batch) with target weights')
print('    Similarity: huber(pred, y_batch).mean(dim=1) * sample_weights')
print()
print('The missing target weighting caused the degradation, NOT the sample weighting!')

=== SIMILARITY WEIGHTING FAILURE ANALYSIS ===

exp_037 (Similarity Weighting): CV 0.022076
exp_038 (Inverse Similarity Weighting): CV 0.022076

IDENTICAL RESULTS! This is highly suspicious.

Evaluator identified the bug:
  - Baseline uses WeightedHuberLoss with target weights [1.0, 1.0, 2.0]
  - Similarity experiments dropped target weighting entirely!
  - Loss computation was different:
    Baseline: loss_fn(pred, y_batch) with target weights
    Similarity: huber(pred, y_batch).mean(dim=1) * sample_weights

The missing target weighting caused the degradation, NOT the sample weighting!


In [4]:
# What approaches haven't been tried?
print('=== UNEXPLORED APPROACHES ===')
print()
print('1. AGGRESSIVE FEATURE SELECTION')
print('   - Current: 145 features for 656 samples (24 solvents)')
print('   - Try: Top 20-30 features by importance')
print('   - Rationale: Fewer features = less overfitting to training solvents')
print()
print('2. SIMPLER MODEL ARCHITECTURE')
print('   - Current: [32, 16] MLP')
print('   - Try: Linear model or [16] single layer')
print('   - Rationale: Simpler models generalize better')
print()
print('3. STRONGER REGULARIZATION')
print('   - Current: weight_decay=1e-4')
print('   - Try: weight_decay=1e-2 or 1e-3')
print('   - Rationale: Prevent memorization of training solvents')
print()
print('4. FIX THE SIMILARITY WEIGHTING BUG')
print('   - Apply sample weights WHILE keeping target weights [1.0, 1.0, 2.0]')
print('   - Apply to LGBM (which handles sample weights natively)')
print()
print('5. DIFFERENT ENSEMBLE STRATEGY')
print('   - Try: Stacking with meta-learner instead of weighted average')
print('   - Try: Model selection per solvent type')

=== UNEXPLORED APPROACHES ===

1. AGGRESSIVE FEATURE SELECTION
   - Current: 145 features for 656 samples (24 solvents)
   - Try: Top 20-30 features by importance
   - Rationale: Fewer features = less overfitting to training solvents

2. SIMPLER MODEL ARCHITECTURE
   - Current: [32, 16] MLP
   - Try: Linear model or [16] single layer
   - Rationale: Simpler models generalize better

3. STRONGER REGULARIZATION
   - Current: weight_decay=1e-4
   - Try: weight_decay=1e-2 or 1e-3
   - Rationale: Prevent memorization of training solvents

4. FIX THE SIMILARITY WEIGHTING BUG
   - Apply sample weights WHILE keeping target weights [1.0, 1.0, 2.0]
   - Apply to LGBM (which handles sample weights natively)

5. DIFFERENT ENSEMBLE STRATEGY
   - Try: Stacking with meta-learner instead of weighted average
   - Try: Model selection per solvent type


In [5]:
# The key insight: intercept is the problem
print('=== KEY INSIGHT: THE INTERCEPT IS THE PROBLEM ===')
print()
print(f'Current CV-LB relationship: LB = {slope:.2f} * CV + {intercept:.4f}')
print(f'Intercept ({intercept:.4f}) > Target ({target_lb})')
print()
print('This means:')
print('  - Even with CV = 0, predicted LB would be 0.0527')
print('  - The intercept represents systematic generalization error')
print('  - We need approaches that REDUCE THE INTERCEPT')
print()
print('What could reduce the intercept?')
print('  1. Simpler models (fewer parameters to overfit)')
print('  2. Fewer features (less opportunity to memorize)')
print('  3. Stronger regularization (prevent memorization)')
print('  4. Domain adaptation (learn solvent-invariant features)')
print()
print('If we can reduce intercept from 0.0527 to ~0.02:')
print(f'  With CV = 0.008194, predicted LB = {slope:.2f} * 0.008194 + 0.02 = {slope * 0.008194 + 0.02:.4f}')
print('  This would be close to target!')

=== KEY INSIGHT: THE INTERCEPT IS THE PROBLEM ===

Current CV-LB relationship: LB = 4.30 * CV + 0.0524
Intercept (0.0524) > Target (0.0347)

This means:
  - Even with CV = 0, predicted LB would be 0.0527
  - The intercept represents systematic generalization error
  - We need approaches that REDUCE THE INTERCEPT

What could reduce the intercept?
  1. Simpler models (fewer parameters to overfit)
  2. Fewer features (less opportunity to memorize)
  3. Stronger regularization (prevent memorization)
  4. Domain adaptation (learn solvent-invariant features)

If we can reduce intercept from 0.0527 to ~0.02:
  With CV = 0.008194, predicted LB = 4.30 * 0.008194 + 0.02 = 0.0553
  This would be close to target!


In [6]:
# Strategic recommendation
print('=== STRATEGIC RECOMMENDATION ===')
print()
print('PRIORITY 1: Aggressive Feature Selection + Simpler Model')
print('  - Get feature importance from LightGBM')
print('  - Select top 25-30 features')
print('  - Train simpler model (Ridge or [16] MLP)')
print('  - Use strong regularization (weight_decay=1e-2)')
print('  - Rationale: Directly attacks the intercept problem')
print()
print('PRIORITY 2: Fix Similarity Weighting Bug')
print('  - Keep target weights [1.0, 1.0, 2.0]')
print('  - Add sample weights on top')
print('  - Apply to LGBM (native sample weight support)')
print('  - Rationale: The idea was sound, implementation was buggy')
print()
print('PRIORITY 3: Submit exp_032 (best CV)')
print('  - CV 0.008194 is our best')
print('  - Predicted LB: 0.0877')
print('  - Verify the CV-LB relationship holds')
print()
print('With 5 submissions remaining, we should:')
print('  1. Try aggressive simplification first (no submission cost)')
print('  2. If CV improves AND approach is fundamentally different, submit')
print('  3. Save submissions for approaches that could change the CV-LB relationship')

=== STRATEGIC RECOMMENDATION ===

PRIORITY 1: Aggressive Feature Selection + Simpler Model
  - Get feature importance from LightGBM
  - Select top 25-30 features
  - Train simpler model (Ridge or [16] MLP)
  - Use strong regularization (weight_decay=1e-2)
  - Rationale: Directly attacks the intercept problem

PRIORITY 2: Fix Similarity Weighting Bug
  - Keep target weights [1.0, 1.0, 2.0]
  - Add sample weights on top
  - Apply to LGBM (native sample weight support)
  - Rationale: The idea was sound, implementation was buggy

PRIORITY 3: Submit exp_032 (best CV)
  - CV 0.008194 is our best
  - Predicted LB: 0.0877
  - Verify the CV-LB relationship holds

With 5 submissions remaining, we should:
  1. Try aggressive simplification first (no submission cost)
  2. If CV improves AND approach is fundamentally different, submit
  3. Save submissions for approaches that could change the CV-LB relationship
