# Loop 1 LB Feedback Analysis

## Submission Results
- **CV Score**: 0.8316
- **LB Score**: 0.7584
- **Gap**: +0.0732 (CV is 7.3% higher than LB)

This is a significant gap that needs investigation. Possible causes:
1. Overfitting to training data
2. Distribution shift between train and test
3. Data leakage in CV (imputation on combined data)
4. Model complexity too high

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')

# Load data
train = pd.read_csv('/home/data/train.csv')
test = pd.read_csv('/home/data/test.csv')

print(f"Train shape: {train.shape}")
print(f"Test shape: {test.shape}")
print(f"\nTrain survival rate: {train['Survived'].mean():.3f}")

In [None]:
# Check submission distribution vs training distribution
submission = pd.read_csv('/home/code/submission_candidates/candidate_000.csv')
print(f"Submission survival rate: {submission['Survived'].mean():.3f}")
print(f"Train survival rate: {train['Survived'].mean():.3f}")
print(f"\nDifference: {submission['Survived'].mean() - train['Survived'].mean():.3f}")

In [None]:
# Analyze feature distributions between train and test
def compare_distributions(train, test, col):
    if col in train.columns and col in test.columns:
        train_vals = train[col].dropna()
        test_vals = test[col].dropna()
        print(f"\n{col}:")
        print(f"  Train mean: {train_vals.mean():.3f}, std: {train_vals.std():.3f}")
        print(f"  Test mean: {test_vals.mean():.3f}, std: {test_vals.std():.3f}")
        print(f"  Difference: {abs(train_vals.mean() - test_vals.mean()):.3f}")

for col in ['Age', 'Fare', 'SibSp', 'Parch', 'Pclass']:
    compare_distributions(train, test, col)

In [None]:
# Check categorical distributions
print("Sex distribution:")
print(f"  Train: {train['Sex'].value_counts(normalize=True).to_dict()}")
print(f"  Test: {test['Sex'].value_counts(normalize=True).to_dict()}")

print("\nEmbarked distribution:")
print(f"  Train: {train['Embarked'].value_counts(normalize=True).to_dict()}")
print(f"  Test: {test['Embarked'].value_counts(normalize=True).to_dict()}")

print("\nPclass distribution:")
print(f"  Train: {train['Pclass'].value_counts(normalize=True).to_dict()}")
print(f"  Test: {test['Pclass'].value_counts(normalize=True).to_dict()}")

In [None]:
# Check missing value patterns
print("Missing values comparison:")
print(f"\nAge missing:")
print(f"  Train: {train['Age'].isna().sum()}/{len(train)} ({train['Age'].isna().mean()*100:.1f}%)")
print(f"  Test: {test['Age'].isna().sum()}/{len(test)} ({test['Age'].isna().mean()*100:.1f}%)")

print(f"\nCabin missing:")
print(f"  Train: {train['Cabin'].isna().sum()}/{len(train)} ({train['Cabin'].isna().mean()*100:.1f}%)")
print(f"  Test: {test['Cabin'].isna().sum()}/{len(test)} ({test['Cabin'].isna().mean()*100:.1f}%)")

In [None]:
# Title extraction and comparison
train['Title'] = train['Name'].str.extract(r' ([A-Za-z]+)\.', expand=False)
test['Title'] = test['Name'].str.extract(r' ([A-Za-z]+)\.', expand=False)

print("Title distribution:")
print(f"\nTrain:")
print(train['Title'].value_counts(normalize=True).head(10))
print(f"\nTest:")
print(test['Title'].value_counts(normalize=True).head(10))

In [None]:
# Analyze the gender-based baseline
# Simple rule: Female = 1, Male = 0
gender_baseline = (test['Sex'] == 'female').astype(int)
print(f"Gender-only baseline prediction survival rate: {gender_baseline.mean():.3f}")
print(f"This is the simplest baseline - typically achieves ~76-77% LB")

# Our submission
print(f"\nOur submission survival rate: {submission['Survived'].mean():.3f}")
print(f"Difference from gender baseline: {submission['Survived'].mean() - gender_baseline.mean():.3f}")

In [None]:
# Analyze fold variance from the experiment
fold_scores = [0.8659, 0.8427, 0.7921, 0.8258, 0.8315]
print(f"Fold scores: {fold_scores}")
print(f"Mean: {np.mean(fold_scores):.4f}")
print(f"Std: {np.std(fold_scores):.4f}")
print(f"Min: {min(fold_scores):.4f}")
print(f"Max: {max(fold_scores):.4f}")
print(f"Range: {max(fold_scores) - min(fold_scores):.4f}")

print(f"\nThe high variance ({np.std(fold_scores):.4f}) suggests the model is sensitive to data splits.")
print(f"The LB score of 0.7584 is below even the worst fold (0.7921).")
print(f"This suggests potential overfitting or distribution shift.")

## Analysis Summary

### Key Findings:
1. **CV-LB Gap of 7.3%** is significant and concerning
2. The LB score (75.84%) is below even the worst CV fold (79.21%)
3. This suggests either:
   - Overfitting to training data patterns
   - Distribution shift between train and test
   - Data leakage in preprocessing (imputation on combined data)

### Recommendations:
1. **Simplify the model** - reduce complexity to prevent overfitting
2. **Use simpler imputation** - avoid combining train+test for imputation
3. **Try ensemble methods** - voting classifiers are more robust
4. **Consider simpler features** - reduce feature count
5. **Regularization** - increase regularization in XGBoost

### Target Recalibration:
- The target of 1.0 (100%) is **impossible**
- State-of-the-art is 81-85% LB
- Current 75.84% is below the gender-only baseline (~76-77%)
- **Immediate goal: Beat 77% LB (gender baseline)**
- **Stretch goal: Reach 80%+ LB**