# Loop 11 Strategic Analysis

## Goal: Decide whether to submit exp_011 and plan next experiments

### Key Questions:
1. Is exp_011's CV improvement statistically significant?
2. What does the CV-LB relationship predict for exp_011?
3. Should we submit or continue experimenting?

In [1]:
import numpy as np
import pandas as pd
from scipy import stats

# Submission history
submissions = [
    {'exp': 'exp_000', 'cv': 0.80674, 'lb': 0.79705},
    {'exp': 'exp_003', 'cv': 0.81951, 'lb': 0.80453},
    {'exp': 'exp_004', 'cv': 0.81928, 'lb': 0.80406},
    {'exp': 'exp_006', 'cv': 0.81709, 'lb': 0.80102},
]

df = pd.DataFrame(submissions)
df['gap'] = df['cv'] - df['lb']
df['gap_pct'] = df['gap'] / df['cv'] * 100
print("Submission History:")
print(df.to_string(index=False))
print(f"\nMean CV-LB gap: {df['gap'].mean():.5f} ({df['gap_pct'].mean():.2f}%)")

Submission History:
    exp      cv      lb     gap  gap_pct
exp_000 0.80674 0.79705 0.00969 1.201130
exp_003 0.81951 0.80453 0.01498 1.827922
exp_004 0.81928 0.80406 0.01522 1.857729
exp_006 0.81709 0.80102 0.01607 1.966736

Mean CV-LB gap: 0.01399 (1.71%)


In [2]:
# Linear regression to predict LB from CV
from sklearn.linear_model import LinearRegression

X = df['cv'].values.reshape(-1, 1)
y = df['lb'].values

model = LinearRegression()
model.fit(X, y)

print(f"CV-LB Model: LB = {model.coef_[0]:.4f} * CV + {model.intercept_:.4f}")
print(f"R-squared = {model.score(X, y):.4f}")

# Predict LB for exp_011
exp_011_cv = 0.82032
predicted_lb = model.predict([[exp_011_cv]])[0]
print(f"\nexp_011 CV: {exp_011_cv:.5f}")
print(f"Predicted LB: {predicted_lb:.5f}")
print(f"Best LB so far: 0.80453")
print(f"Predicted improvement: {predicted_lb - 0.80453:.5f}")

CV-LB Model: LB = 0.5472 * CV + 0.3553


R-squared = 0.9199

exp_011 CV: 0.82032
Predicted LB: 0.80422
Best LB so far: 0.80453
Predicted improvement: -0.00031


In [3]:
# What CV is needed to beat best LB (0.80453)?
target_lb = 0.80453
required_cv = (target_lb - model.intercept_) / model.coef_[0]
print(f"To beat LB 0.80453, need CV > {required_cv:.5f}")
print(f"exp_011 CV: {exp_011_cv:.5f}")
print(f"Margin: {exp_011_cv - required_cv:.5f}")

# What about top LB (0.8066)?
top_lb = 0.8066
required_cv_top = (top_lb - model.intercept_) / model.coef_[0]
print(f"\nTo reach top LB 0.8066, need CV > {required_cv_top:.5f}")
print(f"Gap from current best: {required_cv_top - exp_011_cv:.5f}")

To beat LB 0.80453, need CV > 0.82089
exp_011 CV: 0.82032
Margin: -0.00057

To reach top LB 0.8066, need CV > 0.82467
Gap from current best: 0.00435


In [4]:
# Analyze fold variance from exp_011
fold_scores = [0.80460, 0.84253, 0.83333, 0.79862, 0.82969, 0.82048, 0.83774, 0.81013, 0.81243, 0.81358]

print("exp_011 Fold Scores:")
print(f"Mean: {np.mean(fold_scores):.5f}")
print(f"Std: {np.std(fold_scores):.5f}")
print(f"Min: {min(fold_scores):.5f}")
print(f"Max: {max(fold_scores):.5f}")
print(f"Range: {max(fold_scores) - min(fold_scores):.5f}")

# Compare with exp_003 (5-fold)
exp_003_std = 0.00685
print(f"\nexp_003 std (5-fold): {exp_003_std:.5f}")
print(f"exp_011 std (10-fold): {np.std(fold_scores):.5f}")
print(f"Ratio: {np.std(fold_scores) / exp_003_std:.2f}x higher")

exp_011 Fold Scores:
Mean: 0.82031
Std: 0.01408
Min: 0.79862
Max: 0.84253
Range: 0.04391

exp_003 std (5-fold): 0.00685
exp_011 std (10-fold): 0.01408
Ratio: 2.06x higher


In [5]:
# Is the CV improvement statistically significant?
# exp_003: 0.81951 +/- 0.00685 (5-fold)
# exp_011: 0.82032 +/- 0.01408 (10-fold)

exp_003_cv = 0.81951
exp_003_std = 0.00685
exp_003_n = 5

exp_011_cv = 0.82032
exp_011_std = 0.01408
exp_011_n = 10

# Standard error of the difference
se_diff = np.sqrt((exp_003_std**2 / exp_003_n) + (exp_011_std**2 / exp_011_n))
diff = exp_011_cv - exp_003_cv
t_stat = diff / se_diff

print(f"CV improvement: {diff:.5f}")
print(f"Standard error of difference: {se_diff:.5f}")
print(f"t-statistic: {t_stat:.3f}")
print(f"\nInterpretation:")
if abs(t_stat) < 1.0:
    print("NOT statistically significant (t < 1.0)")
elif abs(t_stat) < 2.0:
    print("Marginally significant (1.0 < t < 2.0)")
else:
    print("Statistically significant (t > 2.0)")

CV improvement: 0.00081
Standard error of difference: 0.00540
t-statistic: 0.150

Interpretation:
NOT statistically significant (t < 1.0)


In [6]:
# Decision analysis
print("=" * 60)
print("DECISION ANALYSIS")
print("=" * 60)

print("\n1. CV IMPROVEMENT:")
print(f"   exp_011 CV: {exp_011_cv:.5f} (BEST EVER)")
print(f"   exp_003 CV: {exp_003_cv:.5f}")
print(f"   Improvement: +{diff:.5f} ({diff/exp_003_cv*100:.2f}%)")

print("\n2. PREDICTED LB:")
print(f"   Predicted: {predicted_lb:.5f}")
print(f"   Best LB: 0.80453")
print(f"   Expected change: {predicted_lb - 0.80453:+.5f}")

print("\n3. STATISTICAL SIGNIFICANCE:")
print(f"   t-statistic: {t_stat:.3f}")
print(f"   Verdict: {'NOT significant' if abs(t_stat) < 1.0 else 'Marginally significant'}")

print("\n4. SUBMISSIONS REMAINING: 6")

print("\n5. RECOMMENDATION:")
if predicted_lb > 0.80453:
    print("   SUBMIT - Predicted LB improvement")
else:
    print("   SUBMIT ANYWAY - Need LB feedback to calibrate regularization effect")
    print("   Regularization improved CV (contrary to overfitting hypothesis)")
    print("   Worth testing if it helps generalization")

DECISION ANALYSIS

1. CV IMPROVEMENT:
   exp_011 CV: 0.82032 (BEST EVER)
   exp_003 CV: 0.81951
   Improvement: +0.00081 (0.10%)

2. PREDICTED LB:
   Predicted: 0.80422
   Best LB: 0.80453
   Expected change: -0.00031

3. STATISTICAL SIGNIFICANCE:
   t-statistic: 0.150
   Verdict: NOT significant

4. SUBMISSIONS REMAINING: 6

5. RECOMMENDATION:
   SUBMIT ANYWAY - Need LB feedback to calibrate regularization effect
   Regularization improved CV (contrary to overfitting hypothesis)
   Worth testing if it helps generalization


In [7]:
# What else can we try to improve?
print("=" * 60)
print("NEXT EXPERIMENT OPTIONS")
print("=" * 60)

print("\n1. GROUPKFOLD (Evaluator's recommendation)")
print("   - Respect group structure in data")
print("   - May reduce fold variance")
print("   - 77.3% are solo travelers - may not help much")

print("\n2. KNN IMPUTATION")
print("   - Top solutions use KNN imputation")
print("   - Our current imputation is simple (mode/median)")
print("   - Different data preprocessing approach")

print("\n3. ENSEMBLE exp_003 + exp_011")
print("   - Combine less regularized (exp_003) + more regularized (exp_011)")
print("   - Different regularization levels capture different patterns")
print("   - Could provide diversity")

print("\n4. NAME-BASED FEATURES")
print("   - Extract surname from Name feature")
print("   - Families traveling together may have correlated outcomes")
print("   - We haven't exploited Name column at all")

print("\n5. PSEUDO-LABELING")
print("   - Use confident test predictions to augment training")
print("   - Could help with distribution shift")

print("\nPRIORITY ORDER:")
print("1. GroupKFold - address high fold variance")
print("2. KNN imputation - different data preprocessing")
print("3. Ensemble exp_003 + exp_011 - combine different regularization levels")

NEXT EXPERIMENT OPTIONS

1. GROUPKFOLD (Evaluator's recommendation)
   - Respect group structure in data
   - May reduce fold variance
   - 77.3% are solo travelers - may not help much

2. KNN IMPUTATION
   - Top solutions use KNN imputation
   - Our current imputation is simple (mode/median)
   - Different data preprocessing approach

3. ENSEMBLE exp_003 + exp_011
   - Combine less regularized (exp_003) + more regularized (exp_011)
   - Different regularization levels capture different patterns
   - Could provide diversity

4. NAME-BASED FEATURES
   - Extract surname from Name feature
   - Families traveling together may have correlated outcomes
   - We haven't exploited Name column at all

5. PSEUDO-LABELING
   - Use confident test predictions to augment training
   - Could help with distribution shift

PRIORITY ORDER:
1. GroupKFold - address high fold variance
2. KNN imputation - different data preprocessing
3. Ensemble exp_003 + exp_011 - combine different regularization levels
