# Loop 2 LB Feedback Analysis

## Critical Observations:
1. CV ~0.011 vs LB ~0.10 (9x gap)
2. Target is 0.017270 which is BELOW best public LB of 0.098
3. Both submissions (MLP and Tree) have similar LB scores (~0.10)

## Hypothesis:
The LB metric might be different from MSE. Let's investigate.

In [1]:
import numpy as np
import pandas as pd

# Load our submission
submission = pd.read_csv('/home/submission/submission.csv')
print(f"Submission shape: {submission.shape}")
print(submission.head())

Submission shape: (1883, 8)
   id  index  task  fold  row  target_1  target_2  target_3
0   0      0     0     0    0  0.003501  0.004214  0.815668
1   1      1     0     0    1  0.005486  0.006379  0.863340
2   2      2     0     0    2  0.027565  0.031438  0.795867
3   3      3     0     0    3  0.061783  0.069505  0.679801
4   4      4     0     0    4  0.079690  0.091531  0.595367


In [2]:
# Load actual data to compare
DATA_PATH = '/home/data'

full_df = pd.read_csv(f'{DATA_PATH}/catechol_full_data_yields.csv')
single_df = pd.read_csv(f'{DATA_PATH}/catechol_single_solvent_yields.csv')

print(f"Single solvent: {single_df.shape}")
print(f"Full data: {full_df.shape}")
print(f"Total: {single_df.shape[0] + full_df.shape[0]}")

Single solvent: (656, 13)
Full data: (1227, 19)
Total: 1883


In [3]:
# Check target statistics
TARGET_LABELS = ["Product 2", "Product 3", "SM"]

print("Single solvent targets:")
print(single_df[TARGET_LABELS].describe())

print("\nFull data targets:")
print(full_df[TARGET_LABELS].describe())

Single solvent targets:
        Product 2   Product 3          SM
count  656.000000  656.000000  656.000000
mean     0.149932    0.123380    0.522192
std      0.143136    0.131528    0.360229
min      0.000000    0.000000    0.000000
25%      0.012976    0.009445    0.145001
50%      0.102813    0.078298    0.656558
75%      0.281654    0.193353    0.857019
max      0.463632    0.533768    1.000000

Full data targets:
         Product 2    Product 3           SM
count  1227.000000  1227.000000  1227.000000
mean      0.164626     0.143668     0.495178
std       0.153467     0.145779     0.379425
min       0.000000     0.000000     0.000000
25%       0.012723     0.012260     0.068573
50%       0.117330     0.094413     0.606454
75%       0.308649     0.254630     0.877448
max       0.463632     0.533768     1.083254


In [4]:
# Calculate different metrics on our predictions
# First, let's reconstruct the actual values

# For single solvent (task=0)
single_preds = submission[submission['task'] == 0][['target_1', 'target_2', 'target_3']].values
single_actuals = single_df[TARGET_LABELS].values

print(f"Single preds shape: {single_preds.shape}")
print(f"Single actuals shape: {single_actuals.shape}")

Single preds shape: (656, 3)
Single actuals shape: (656, 3)


In [5]:
# Calculate various metrics
def mse(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

def rmse(y_true, y_pred):
    return np.sqrt(mse(y_true, y_pred))

def mae(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred))

def rmsle(y_true, y_pred):
    # Root Mean Squared Logarithmic Error
    y_true = np.clip(y_true, 1e-10, None)
    y_pred = np.clip(y_pred, 1e-10, None)
    return np.sqrt(np.mean((np.log1p(y_true) - np.log1p(y_pred)) ** 2))

# Calculate for single solvent
print("Single Solvent Metrics:")
print(f"  MSE:  {mse(single_actuals, single_preds):.6f}")
print(f"  RMSE: {rmse(single_actuals, single_preds):.6f}")
print(f"  MAE:  {mae(single_actuals, single_preds):.6f}")
print(f"  RMSLE: {rmsle(single_actuals, single_preds):.6f}")

Single Solvent Metrics:
  MSE:  0.109317
  RMSE: 0.330631
  MAE:  0.234530
  RMSLE: 0.242635


In [6]:
# For full data (task=1)
full_preds = submission[submission['task'] == 1][['target_1', 'target_2', 'target_3']].values
full_actuals = full_df[TARGET_LABELS].values

print(f"Full preds shape: {full_preds.shape}")
print(f"Full actuals shape: {full_actuals.shape}")

print("\nFull Data Metrics:")
print(f"  MSE:  {mse(full_actuals, full_preds):.6f}")
print(f"  RMSE: {rmse(full_actuals, full_preds):.6f}")
print(f"  MAE:  {mae(full_actuals, full_preds):.6f}")
print(f"  RMSLE: {rmsle(full_actuals, full_preds):.6f}")

Full preds shape: (1227, 3)
Full actuals shape: (1227, 3)

Full Data Metrics:
  MSE:  0.010857
  RMSE: 0.104198
  MAE:  0.068499
  RMSLE: 0.081809


In [7]:
# Combined metrics
all_actuals = np.vstack([single_actuals, full_actuals])
all_preds = np.vstack([single_preds, full_preds])

print("\nOverall Metrics:")
print(f"  MSE:  {mse(all_actuals, all_preds):.6f}")
print(f"  RMSE: {rmse(all_actuals, all_preds):.6f}")
print(f"  MAE:  {mae(all_actuals, all_preds):.6f}")
print(f"  RMSLE: {rmsle(all_actuals, all_preds):.6f}")


Overall Metrics:
  MSE:  0.045159
  RMSE: 0.212506
  MAE:  0.126341
  RMSLE: 0.157705


In [8]:
# Key insight: If LB is ~0.10 and our MSE is ~0.011
# Then LB might be RMSE (sqrt(0.011) = 0.105) or MAE

print("\nMetric Comparison:")
print(f"Our CV MSE: 0.010986")
print(f"sqrt(CV MSE) = RMSE: {np.sqrt(0.010986):.6f}")
print(f"LB Score: 0.0999")
print(f"\nConclusion: LB metric is likely RMSE, not MSE!")


Metric Comparison:
Our CV MSE: 0.010986
sqrt(CV MSE) = RMSE: 0.104814
LB Score: 0.0999

Conclusion: LB metric is likely RMSE, not MSE!


In [9]:
# If LB is RMSE, then target of 0.017270 means:
# Target RMSE = 0.017270
# Target MSE = 0.017270^2 = 0.000298

print("\nTarget Analysis:")
print(f"Target score: 0.017270")
print(f"If RMSE: Target MSE = {0.017270**2:.6f}")
print(f"If MSE: Target RMSE = {np.sqrt(0.017270):.6f}")
print(f"\nOur current RMSE: {np.sqrt(0.010986):.6f}")
print(f"Gap to target (if RMSE): {np.sqrt(0.010986) - 0.017270:.6f}")


Target Analysis:
Target score: 0.017270
If RMSE: Target MSE = 0.000298
If MSE: Target RMSE = 0.131415

Our current RMSE: 0.104814
Gap to target (if RMSE): 0.087544


In [10]:
# Wait - the target 0.017270 is LOWER than our LB of 0.10
# If LB is RMSE, then target RMSE of 0.017 means MSE of 0.0003
# That's 30x better than our current MSE of 0.011!

# Let's check if the target might be MSE after all
print("Scenario Analysis:")
print("\nIf LB metric is MSE:")
print(f"  Our LB: 0.0999 (MSE)")
print(f"  Target: 0.017270 (MSE)")
print(f"  Gap: {0.0999 - 0.017270:.6f}")
print(f"  Improvement needed: {(0.0999 - 0.017270) / 0.0999 * 100:.1f}%")

print("\nIf LB metric is RMSE:")
print(f"  Our LB: 0.0999 (RMSE)")
print(f"  Target: 0.017270 (RMSE)")
print(f"  Gap: {0.0999 - 0.017270:.6f}")
print(f"  Improvement needed: {(0.0999 - 0.017270) / 0.0999 * 100:.1f}%")

Scenario Analysis:

If LB metric is MSE:
  Our LB: 0.0999 (MSE)
  Target: 0.017270 (MSE)
  Gap: 0.082630
  Improvement needed: 82.7%

If LB metric is RMSE:
  Our LB: 0.0999 (RMSE)
  Target: 0.017270 (RMSE)
  Gap: 0.082630
  Improvement needed: 82.7%


In [11]:
# The target of 0.017270 is 5-6x better than best public LB of 0.098
# This is a HUGE gap - suggests either:
# 1. The target is achievable through fundamentally different approaches
# 2. The target represents a different metric/evaluation
# 3. There's something special about the winning approach

# Let's analyze per-target errors to see where we're losing
print("Per-Target Analysis (Single Solvent):")
for i, target in enumerate(['Product 2', 'Product 3', 'SM']):
    target_mse = mse(single_actuals[:, i], single_preds[:, i])
    target_mae = mae(single_actuals[:, i], single_preds[:, i])
    print(f"  {target}: MSE={target_mse:.6f}, MAE={target_mae:.6f}, RMSE={np.sqrt(target_mse):.6f}")

print("\nPer-Target Analysis (Full Data):")
for i, target in enumerate(['Product 2', 'Product 3', 'SM']):
    target_mse = mse(full_actuals[:, i], full_preds[:, i])
    target_mae = mae(full_actuals[:, i], full_preds[:, i])
    print(f"  {target}: MSE={target_mse:.6f}, MAE={target_mae:.6f}, RMSE={np.sqrt(target_mse):.6f}")

Per-Target Analysis (Single Solvent):
  Product 2: MSE=0.037323, MAE=0.154467, RMSE=0.193192
  Product 3: MSE=0.028752, MAE=0.133125, RMSE=0.169564
  SM: MSE=0.261876, MAE=0.415997, RMSE=0.511738

Per-Target Analysis (Full Data):
  Product 2: MSE=0.009361, MAE=0.061082, RMSE=0.096750
  Product 3: MSE=0.012838, MAE=0.072987, RMSE=0.113305
  SM: MSE=0.010373, MAE=0.071427, RMSE=0.101847


In [12]:
# Check if there are outliers in predictions
print("Prediction Statistics:")
print("\nSingle Solvent Predictions:")
print(pd.DataFrame(single_preds, columns=['Product 2', 'Product 3', 'SM']).describe())

print("\nFull Data Predictions:")
print(pd.DataFrame(full_preds, columns=['Product 2', 'Product 3', 'SM']).describe())

Prediction Statistics:

Single Solvent Predictions:
        Product 2   Product 3          SM
count  656.000000  656.000000  656.000000
mean     0.136576    0.122314    0.542240
std      0.117626    0.102686    0.339582
min      0.000033    0.000543    0.000000
25%      0.035657    0.036353    0.194687
50%      0.096073    0.092997    0.678806
75%      0.226385    0.191992    0.852793
max      0.433083    0.390699    1.000000

Full Data Predictions:
         Product 2    Product 3           SM
count  1227.000000  1227.000000  1227.000000
mean      0.166421     0.153238     0.488704
std       0.136136     0.125956     0.366357
min       0.000000     0.000000     0.000000
25%       0.027428     0.029514     0.109758
50%       0.143730     0.129578     0.589105
75%       0.301773     0.257217     0.879787
max       0.440235     0.413043     1.000000


In [None]:
# Check for large errors
errors = np.abs(all_actuals - all_preds)
print("Error Distribution:")
print(f"Max error: {errors.max():.4f}")
print(f"Mean error: {errors.mean():.4f}")
print(f"Median error: {np.median(errors):.4f}")
print(f"95th percentile: {np.percentile(errors, 95):.4f}")
print(f"99th percentile: {np.percentile(errors, 99):.4f}")

# Count large errors
print(f"\nErrors > 0.3: {(errors > 0.3).sum()} ({(errors > 0.3).mean()*100:.1f}%)")
print(f"Errors > 0.5: {(errors > 0.5).sum()} ({(errors > 0.5).mean()*100:.1f}%)")

## Key Findings

1. **CV-LB Gap Explanation**: Our CV MSE of 0.011 corresponds to RMSE of ~0.105, which matches the LB score of ~0.10. This suggests the LB metric is RMSE, not MSE.

2. **Target Analysis**: The target of 0.017270 is 5-6x better than the best public LB of 0.098. This is a massive gap.

3. **Possible Explanations**:
   - The target might be achievable through approaches not yet explored publicly
   - The target might represent a different evaluation (e.g., on a subset of data)
   - There might be domain-specific techniques that dramatically improve predictions

4. **Next Steps**:
   - Focus on reducing RMSE, not MSE
   - Try fundamentally different approaches (Gaussian Processes, Deep Kernel Learning)
   - Investigate if certain solvents/conditions are much harder to predict

In [13]:
# CRITICAL: The submission contains CV predictions, not predictions on full data
# Each fold's predictions are for the held-out test set
# So we need to match predictions to actual values by fold

# Let's check the fold structure
print("Submission fold structure:")
print(submission.groupby(['task', 'fold']).size())

Submission fold structure:
task  fold
0     0        37
      1        37
      2        58
      3        59
      4        22
      5        18
      6        34
      7        41
      8        20
      9        22
      10       18
      11       18
      12       42
      13       18
      14       17
      15       22
      16        5
      17       16
      18       36
      19       18
      20       21
      21       22
      22       37
      23       18
1     0       122
      1       124
      2       104
      3       125
      4       125
      5       124
      6       125
      7       110
      8       127
      9        36
      10       34
      11       36
      12       35
dtype: int64
