# Loop 4 LB Feedback Analysis

## Key Results
- **exp_003 (Stacking)**: CV 0.0103 → LB 0.0949
- This is our BEST LB score yet!

## Submission History
| Exp | Model | CV MSE | LB Score | CV-LB Gap |
|-----|-------|--------|----------|----------|
| exp_000 | MLP | 0.0113 | 0.0998 | -0.0885 |
| exp_001 | Trees | 0.0110 | 0.0999 | -0.0889 |
| exp_003 | Stacking | 0.0103 | 0.0949 | -0.0846 |

## Critical Insights
1. **CV improvements DO translate to LB improvements!**
   - CV improved from 0.0113 → 0.0103 (9% better)
   - LB improved from 0.0998 → 0.0949 (5% better)

2. **The gap is shrinking!**
   - exp_000: gap = 0.0885
   - exp_003: gap = 0.0846
   - This suggests we're on the right track

3. **Target analysis:**
   - Target: 0.017270
   - Best LB: 0.0949
   - Gap to target: 5.5x

In [1]:
import numpy as np
import pandas as pd

# Submission history
submissions = [
    {'exp': 'exp_000', 'model': 'MLP', 'cv_mse': 0.0113, 'lb': 0.0998},
    {'exp': 'exp_001', 'model': 'Trees', 'cv_mse': 0.0110, 'lb': 0.0999},
    {'exp': 'exp_003', 'model': 'Stacking', 'cv_mse': 0.0103, 'lb': 0.0949},
]

df = pd.DataFrame(submissions)
df['cv_rmse'] = np.sqrt(df['cv_mse'])
df['gap'] = df['lb'] - df['cv_mse']
print("Submission History:")
print(df.to_string(index=False))

Submission History:
    exp    model  cv_mse     lb  cv_rmse    gap
exp_000      MLP  0.0113 0.0998 0.106301 0.0885
exp_001    Trees  0.0110 0.0999 0.104881 0.0889
exp_003 Stacking  0.0103 0.0949 0.101489 0.0846


In [2]:
# Key insight: LB is likely RMSE, not MSE
# Let's verify:
print("\nCV RMSE vs LB comparison:")
for _, row in df.iterrows():
    print(f"{row['exp']}: CV RMSE = {row['cv_rmse']:.4f}, LB = {row['lb']:.4f}, diff = {abs(row['cv_rmse'] - row['lb']):.4f}")

print("\nThe CV RMSE values are close to LB scores!")
print("This confirms LB metric is likely RMSE.")

# Target analysis
target = 0.017270
print(f"\nTarget: {target}")
print(f"Best LB: {df['lb'].min()}")
print(f"Gap to target: {df['lb'].min() / target:.1f}x")


CV RMSE vs LB comparison:
exp_000: CV RMSE = 0.1063, LB = 0.0998, diff = 0.0065
exp_001: CV RMSE = 0.1049, LB = 0.0999, diff = 0.0050
exp_003: CV RMSE = 0.1015, LB = 0.0949, diff = 0.0066

The CV RMSE values are close to LB scores!
This confirms LB metric is likely RMSE.

Target: 0.01727
Best LB: 0.0949
Gap to target: 5.5x


In [3]:
# Progress analysis
print("\n=== PROGRESS ANALYSIS ===")
print(f"\nLB improvement from exp_000 to exp_003:")
print(f"  exp_000 LB: 0.0998")
print(f"  exp_003 LB: 0.0949")
print(f"  Improvement: {(0.0998 - 0.0949) / 0.0998 * 100:.1f}%")

print(f"\nCV improvement from exp_000 to exp_003:")
print(f"  exp_000 CV: 0.0113")
print(f"  exp_003 CV: 0.0103")
print(f"  Improvement: {(0.0113 - 0.0103) / 0.0113 * 100:.1f}%")

print("\nCV improvements ARE translating to LB improvements!")
print("The stacking ensemble is working!")
print("\nNext steps: Continue improving CV to improve LB.")


=== PROGRESS ANALYSIS ===

LB improvement from exp_000 to exp_003:
  exp_000 LB: 0.0998
  exp_003 LB: 0.0949
  Improvement: 4.9%

CV improvement from exp_000 to exp_003:
  exp_000 CV: 0.0113
  exp_003 CV: 0.0103
  Improvement: 8.8%

CV improvements ARE translating to LB improvements!
The stacking ensemble is working!

Next steps: Continue improving CV to improve LB.


In [4]:
# What would it take to reach the target?
target = 0.017270
current_lb = 0.0949

print("\n=== PATH TO TARGET ===")
print(f"\nTarget: {target}")
print(f"Current best LB: {current_lb}")
print(f"Gap: {current_lb / target:.1f}x")

# If LB is RMSE, then target RMSE = 0.017270
# This means target MSE = 0.017270^2 = 0.000298
target_mse = target ** 2
print(f"\nIf target is RMSE:")
print(f"  Target MSE would be: {target_mse:.6f}")
print(f"  Current CV MSE: 0.0103")
print(f"  Need to reduce MSE by: {0.0103 / target_mse:.1f}x")

# This is a HUGE gap - 34x improvement needed
# But wait - what if the target is achievable through a different approach?
print("\n=== POSSIBLE APPROACHES ===")
print("1. High-dimensional features (drfps: 2048, fragprints: 2133)")
print("2. Regressor chains (predict SM first, use as input)")
print("3. Per-target optimization (SM has highest variance)")
print("4. More diverse ensemble (add LightGBM, XGBoost)")
print("5. Feature selection / dimensionality reduction")


=== PATH TO TARGET ===

Target: 0.01727
Current best LB: 0.0949
Gap: 5.5x

If target is RMSE:
  Target MSE would be: 0.000298
  Current CV MSE: 0.0103
  Need to reduce MSE by: 34.5x

=== POSSIBLE APPROACHES ===
1. High-dimensional features (drfps: 2048, fragprints: 2133)
2. Regressor chains (predict SM first, use as input)
3. Per-target optimization (SM has highest variance)
4. More diverse ensemble (add LightGBM, XGBoost)
5. Feature selection / dimensionality reduction


In [5]:
# Let's analyze what we know about the data
DATA_PATH = '/home/data'
full_df = pd.read_csv(f'{DATA_PATH}/catechol_full_data_yields.csv')
single_df = pd.read_csv(f'{DATA_PATH}/catechol_single_solvent_yields.csv')

print("\n=== DATA ANALYSIS ===")
print(f"\nFull data: {len(full_df)} samples")
print(f"Single solvent: {len(single_df)} samples")
print(f"Total: {len(full_df) + len(single_df)} samples")

# Target statistics
print("\nTarget statistics (single solvent):")
for col in ['SM', 'Product 2', 'Product 3']:
    print(f"  {col}: mean={single_df[col].mean():.4f}, std={single_df[col].std():.4f}, var={single_df[col].var():.4f}")

print("\nTarget statistics (full data):")
for col in ['SM', 'Product 2', 'Product 3']:
    print(f"  {col}: mean={full_df[col].mean():.4f}, std={full_df[col].std():.4f}, var={full_df[col].var():.4f}")


=== DATA ANALYSIS ===

Full data: 1227 samples
Single solvent: 656 samples
Total: 1883 samples

Target statistics (single solvent):
  SM: mean=0.5222, std=0.3602, var=0.1298
  Product 2: mean=0.1499, std=0.1431, var=0.0205
  Product 3: mean=0.1234, std=0.1315, var=0.0173

Target statistics (full data):
  SM: mean=0.4952, std=0.3794, var=0.1440
  Product 2: mean=0.1646, std=0.1535, var=0.0236
  Product 3: mean=0.1437, std=0.1458, var=0.0213


In [6]:
# SM has highest variance - this is the main source of error
# Let's calculate the contribution of each target to the overall MSE

# If we assume equal MSE per target:
# Overall MSE = (MSE_SM + MSE_P2 + MSE_P3) / 3
# But SM has higher variance, so it likely contributes more

# Variance-based analysis:
var_sm_single = single_df['SM'].var()
var_p2_single = single_df['Product 2'].var()
var_p3_single = single_df['Product 3'].var()

print("\n=== VARIANCE ANALYSIS ===")
print(f"\nSingle solvent variance:")
print(f"  SM: {var_sm_single:.4f} ({var_sm_single / (var_sm_single + var_p2_single + var_p3_single) * 100:.1f}% of total)")
print(f"  Product 2: {var_p2_single:.4f} ({var_p2_single / (var_sm_single + var_p2_single + var_p3_single) * 100:.1f}% of total)")
print(f"  Product 3: {var_p3_single:.4f} ({var_p3_single / (var_sm_single + var_p2_single + var_p3_single) * 100:.1f}% of total)")

print("\nSM contributes ~80% of the total variance!")
print("Improving SM prediction is the key to reducing overall error.")


=== VARIANCE ANALYSIS ===

Single solvent variance:
  SM: 0.1298 (77.4% of total)
  Product 2: 0.0205 (12.2% of total)
  Product 3: 0.0173 (10.3% of total)

SM contributes ~80% of the total variance!
Improving SM prediction is the key to reducing overall error.


In [7]:
# Feature analysis - what features are available?
import os

print("\n=== AVAILABLE FEATURES ===")
feature_files = [
    'spange_descriptors_lookup.csv',
    'acs_pca_descriptors_lookup.csv',
    'drfps_catechol_lookup.csv',
    'fragprints_lookup.csv',
]

for f in feature_files:
    path = f'{DATA_PATH}/{f}'
    if os.path.exists(path):
        df = pd.read_csv(path, index_col=0)
        print(f"  {f}: {df.shape[1]} features, {df.shape[0]} solvents")

print("\nHigh-dimensional features (drfps, fragprints) are unexplored!")
print("These could capture chemistry better than low-dimensional descriptors.")


=== AVAILABLE FEATURES ===
  spange_descriptors_lookup.csv: 13 features, 26 solvents
  acs_pca_descriptors_lookup.csv: 5 features, 24 solvents
  drfps_catechol_lookup.csv: 2048 features, 24 solvents
  fragprints_lookup.csv: 2133 features, 24 solvents

High-dimensional features (drfps, fragprints) are unexplored!
These could capture chemistry better than low-dimensional descriptors.


In [8]:
# Strategy for next experiments
print("\n" + "="*60)
print("STRATEGY FOR NEXT EXPERIMENTS")
print("="*60)

print("\n1. PRIORITY: Improve SM prediction")
print("   - SM contributes ~80% of variance")
print("   - Focus models on SM target")
print("   - Try regressor chain: predict SM first, use as input for Products")

print("\n2. HIGH-DIMENSIONAL FEATURES")
print("   - drfps (2048 features): Differential reaction fingerprints")
print("   - fragprints (2133 features): Fragment + fingerprint")
print("   - These may capture chemistry better")
print("   - Use PCA or feature selection to reduce dimensionality")

print("\n3. ENSEMBLE OPTIMIZATION")
print("   - Current: 50/50 MLP + Trees")
print("   - Try: 60/40, 70/30, or learned weights")
print("   - Add more diverse models: LightGBM, XGBoost, CatBoost")

print("\n4. REGRESSOR CHAINS")
print("   - Targets are correlated (chemical yields)")
print("   - Predict SM first (highest variance)")
print("   - Use SM prediction as input for Products")

print("\n5. REMAINING SUBMISSIONS: 2")
print("   - Use wisely to verify progress")
print("   - Submit only when CV improves significantly")


STRATEGY FOR NEXT EXPERIMENTS

1. PRIORITY: Improve SM prediction
   - SM contributes ~80% of variance
   - Focus models on SM target
   - Try regressor chain: predict SM first, use as input for Products

2. HIGH-DIMENSIONAL FEATURES
   - drfps (2048 features): Differential reaction fingerprints
   - fragprints (2133 features): Fragment + fingerprint
   - These may capture chemistry better
   - Use PCA or feature selection to reduce dimensionality

3. ENSEMBLE OPTIMIZATION
   - Current: 50/50 MLP + Trees
   - Try: 60/40, 70/30, or learned weights
   - Add more diverse models: LightGBM, XGBoost, CatBoost

4. REGRESSOR CHAINS
   - Targets are correlated (chemical yields)
   - Predict SM first (highest variance)
   - Use SM prediction as input for Products

5. REMAINING SUBMISSIONS: 2
   - Use wisely to verify progress
   - Submit only when CV improves significantly
