# Loop 2: LB Feedback Analysis - Massive CV-LB Gap Investigation

**Submission**: exp_002_fixed_preprocessing  
**CV Score**: 83.84%  
**LB Score**: 0.7464  
**Gap**: +83.0936 (CV much higher than LB)

This is a MASSIVE gap that needs immediate investigation. Either:
1. Our CV is severely over-optimistic
2. There's a fundamental misunderstanding of the metric
3. Data leakage is still present
4. The submission format is wrong

Let's investigate systematically.

In [1]:
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
import seaborn as sns

# Load session state
with open('/home/code/session_state.json', 'r') as f:
    session_state = json.load(f)

print("=== SUBMISSION HISTORY ===")
for sub in session_state['submissions']:
    print(f"Experiment: {sub['experiment_id']}")
    print(f"CV Score: {sub['cv_score']:.4f}")
    print(f"LB Score: {sub['lb_score']:.4f}")
    print(f"Gap: {sub['cv_score'] - sub['lb_score']:.4f}")
    print()

print(f"Remaining submissions: {session_state['remaining_submissions']}/10")

=== SUBMISSION HISTORY ===
Experiment: exp_001
CV Score: 83.8400
LB Score: 0.7464
Gap: 83.0936

Remaining submissions: 8/10


## 1. Verify Submission Format

First, let's check if the submission file was created correctly.

In [2]:
# Load the actual submission file
submission_path = '/home/code/submission_candidates/candidate_001.csv'
submission = pd.read_csv(submission_path)

print("=== SUBMISSION FILE CHECK ===")
print(f"Shape: {submission.shape}")
print(f"Columns: {list(submission.columns)}")
print()
print("First 10 rows:")
print(submission.head(10))
print()
print("Value distribution:")
print(submission['Survived'].value_counts())
print()
print("Data types:")
print(submission.dtypes)

=== SUBMISSION FILE CHECK ===
Shape: (418, 2)
Columns: ['PassengerId', 'Survived']

First 10 rows:
   PassengerId  Survived
0          892         0
1          893         0
2          894         0
3          895         0
4          896         1
5          897         0
6          898         0
7          899         0
8          900         1
9          901         0

Value distribution:
Survived
0    260
1    158
Name: count, dtype: int64

Data types:
PassengerId    int64
Survived       int64
dtype: object


## 2. Check Metric Direction

The session state says metric_direction=false, which means lower is better. But we're interpreting accuracy (higher is better). Let's verify what metric Kaggle actually uses for Titanic.

In [3]:
# Let's check what the actual evaluation metric should be
# Titanic competition typically uses accuracy (higher is better)

print("=== METRIC INVESTIGATION ===")
print("Session state metric_direction:", session_state['metric_direction'])
print("This means 'lower is better' according to the system")
print()
print("But Titanic competition uses ACCURACY (higher is better)")
print("Typical Titanic LB scores are in the range: 0.70 - 0.85")
print()
print("POSSIBLE ISSUE: The metric direction might be configured wrong!")
print("If the system thinks lower is better, it might be:")
print("1. Reporting 1 - accuracy instead of accuracy")
print("2. Or there's a fundamental misunderstanding")

=== METRIC INVESTIGATION ===
Session state metric_direction: False
This means 'lower is better' according to the system

But Titanic competition uses ACCURACY (higher is better)
Typical Titanic LB scores are in the range: 0.70 - 0.85

POSSIBLE ISSUE: The metric direction might be configured wrong!
If the system thinks lower is better, it might be:
1. Reporting 1 - accuracy instead of accuracy
2. Or there's a fundamental misunderstanding


## 3. Compare CV vs LB Score Ranges

Let's look at what typical CV and LB scores should be for Titanic.

In [4]:
print("=== EXPECTED SCORE RANGES ===")
print("Based on Kaggle Titanic competition history:")
print("- Good CV scores: 0.80 - 0.86 (80% to 86% accuracy)")
print("- Good LB scores: 0.75 - 0.85 (75% to 85% accuracy)")
print("- Typical CV-LB gap: ±0.02 to ±0.05 (2% to 5%)")
print()
print("OUR SCORES:")
print(f"- CV: 0.8384 (83.84% - this is GOOD)")
print(f"- LB: 0.7464 (74.64% - this is REASONABLE)")
print(f"- Gap: +0.0920 (9.2% - this is LARGE but not catastrophic)")
print()
print("INTERPRETATION:")
print("- Our CV score of 83.84% is actually very good for Titanic")
print("- Our LB score of 74.64% is reasonable but on the lower side")
print("- The 9.2% gap suggests some overfitting or distribution shift")
print("- BUT: The gap is NOT +83.09 as initially calculated!")
print()
print("THE REAL GAP IS 9.2%, NOT 83.09%!")

=== EXPECTED SCORE RANGES ===
Based on Kaggle Titanic competition history:
- Good CV scores: 0.80 - 0.86 (80% to 86% accuracy)
- Good LB scores: 0.75 - 0.85 (75% to 85% accuracy)
- Typical CV-LB gap: ±0.02 to ±0.05 (2% to 5%)

OUR SCORES:
- CV: 0.8384 (83.84% - this is GOOD)
- LB: 0.7464 (74.64% - this is REASONABLE)
- Gap: +0.0920 (9.2% - this is LARGE but not catastrophic)

INTERPRETATION:
- Our CV score of 83.84% is actually very good for Titanic
- Our LB score of 74.64% is reasonable but on the lower side
- The 9.2% gap suggests some overfitting or distribution shift
- BUT: The gap is NOT +83.09 as initially calculated!

THE REAL GAP IS 9.2%, NOT 83.09%!


## 4. Investigate Potential Issues

Let's check for common problems that cause CV-LB gaps.

In [5]:
# Load training data to check for potential issues
train_path = '/home/data/train.csv'
test_path = '/home/data/test.csv'

train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)

print("=== DATA DISTRIBUTION ANALYSIS ===")
print(f"Training set: {train_df.shape}")
print(f"Test set: {test_df.shape}")
print()

# Check for distribution shift in key features
print("Sex distribution:")
print("Train:", train_df['Sex'].value_counts(normalize=True))
print("Test:", test_df['Sex'].value_counts(normalize=True))
print()

print("Pclass distribution:")
print("Train:", train_df['Pclass'].value_counts(normalize=True))
print("Test:", test_df['Pclass'].value_counts(normalize=True))
print()

print("Embarked distribution:")
print("Train:", train_df['Embarked'].value_counts(normalize=True))
print("Test:", test_df['Embarked'].value_counts(normalize=True))
print()

# Check for missing values
print("Missing values in test set:")
print(test_df.isnull().sum())

=== DATA DISTRIBUTION ANALYSIS ===
Training set: (891, 12)
Test set: (418, 11)

Sex distribution:
Train: Sex
male      0.647587
female    0.352413
Name: proportion, dtype: float64
Test: Sex
male      0.636364
female    0.363636
Name: proportion, dtype: float64

Pclass distribution:
Train: Pclass
3    0.551066
1    0.242424
2    0.206510
Name: proportion, dtype: float64
Test: Pclass
3    0.521531
1    0.255981
2    0.222488
Name: proportion, dtype: float64

Embarked distribution:
Train: Embarked
S    0.724409
C    0.188976
Q    0.086614
Name: proportion, dtype: float64
Test: Embarked
S    0.645933
C    0.244019
Q    0.110048
Name: proportion, dtype: float64

Missing values in test set:
PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64


## 5. Feature Engineering Analysis

Let's see if our features might be causing overfitting.

In [6]:
print("=== FEATURE ANALYSIS ===")
print("Our feature engineering includes:")
print("- Title extraction (Mr, Mrs, Miss, Master, Other)")
print("- FamilySize (SibSp + Parch + 1)")
print("- IsAlone flag (FamilySize == 1)")
print("- Age bins [0,12,18,35,60,100]")
print("- FarePerPerson (Fare / FamilySize)")
print("- HasCabin (binary)")
print()
print("Potential overfitting risks:")
print("- Age bins with many thresholds (5 bins) on small dataset")
print("- Title 'Other' category with 29 rare titles combined")
print("- No interaction features (which could help generalization)")
print()
print("The model might be memorizing training patterns that don't generalize.")

=== FEATURE ANALYSIS ===
Our feature engineering includes:
- Title extraction (Mr, Mrs, Miss, Master, Other)
- FamilySize (SibSp + Parch + 1)
- IsAlone flag (FamilySize == 1)
- Age bins [0,12,18,35,60,100]
- FarePerPerson (Fare / FamilySize)
- HasCabin (binary)

Potential overfitting risks:
- Age bins with many thresholds (5 bins) on small dataset
- Title 'Other' category with 29 rare titles combined
- No interaction features (which could help generalization)

The model might be memorizing training patterns that don't generalize.


## 6. Model Complexity Check

Let's check if our model is too complex for the dataset size.

In [7]:
print("=== MODEL COMPLEXITY ANALYSIS ===")
print("XGBoost parameters:")
print("- n_estimators: 500")
print("- max_depth: 5") 
print("- learning_rate: 0.1")
print()
print(f"Training set size: {train_df.shape[0]} samples")
print(f"Number of features: ~15 (after encoding)")
print()
print("Complexity assessment:")
print("- 500 trees is reasonable for 891 samples")
print("- max_depth=5 is moderate complexity")
print("- But with many engineered features, could be overfitting")
print()
print("Suggestion: Try reducing complexity:")
print("- max_depth: 3-4 instead of 5")
print("- learning_rate: 0.05 instead of 0.1")
print("- Add regularization: min_child_weight, gamma")

=== MODEL COMPLEXITY ANALYSIS ===
XGBoost parameters:
- n_estimators: 500
- max_depth: 5
- learning_rate: 0.1

Training set size: 891 samples
Number of features: ~15 (after encoding)

Complexity assessment:
- 500 trees is reasonable for 891 samples
- max_depth=5 is moderate complexity
- But with many engineered features, could be overfitting

Suggestion: Try reducing complexity:
- max_depth: 3-4 instead of 5
- learning_rate: 0.05 instead of 0.1
- Add regularization: min_child_weight, gamma


## 7. Action Plan

Based on this analysis, here's what we need to do:

In [8]:
print("=== ACTION PLAN ===")
print()
print("1. CORRECT THE GAP CALCULATION:")
print("   - Real gap is 9.2%, not 83.09%")
print("   - CV: 83.84%, LB: 74.64%")
print("   - This is large but manageable")
print()
print("2. IMMEDIATE NEXT STEPS:")
print("   a) Hyperparameter tuning (Evaluator's top priority)")
print("   b) Try simpler model (max_depth=3-4)")
print("   c) Add interaction features (Pclass×Sex, Age×Sex)")
print("   d) Refine title categories (split 'Other')")
print()
print("3. ENSEMBLE STRATEGY:")
print("   - Add Logistic Regression for diversity")
print("   - Weighted blend: 75% XGBoost + 25% LR")
print("   - This often helps with generalization")
print()
print("4. MONITORING:")
print("   - Track CV-LB gap after each change")
print("   - Aim to reduce gap to <5%")
print("   - Target LB score: 0.80+")

=== ACTION PLAN ===

1. CORRECT THE GAP CALCULATION:
   - Real gap is 9.2%, not 83.09%
   - CV: 83.84%, LB: 74.64%
   - This is large but manageable

2. IMMEDIATE NEXT STEPS:
   a) Hyperparameter tuning (Evaluator's top priority)
   b) Try simpler model (max_depth=3-4)
   c) Add interaction features (Pclass×Sex, Age×Sex)
   d) Refine title categories (split 'Other')

3. ENSEMBLE STRATEGY:
   - Add Logistic Regression for diversity
   - Weighted blend: 75% XGBoost + 25% LR
   - This often helps with generalization

4. MONITORING:
   - Track CV-LB gap after each change
   - Aim to reduce gap to <5%
   - Target LB score: 0.80+
