# Loop 2 Strategic Analysis

**Goal:** Analyze current state and determine next steps

**Current Status:**
- exp_000: CV 0.8316, LB 0.7584 (gap +0.0732)
- exp_001: CV 0.8372, LB unknown (not submitted)
- Target: 1.0 (unrealistic - best legitimate ~0.80-0.82)

In [1]:
import pandas as pd
import numpy as np

# Load data to analyze
train = pd.read_csv('/home/data/train.csv')
test = pd.read_csv('/home/data/test.csv')

print(f'Train shape: {train.shape}')
print(f'Test shape: {test.shape}')
print(f'\nTarget distribution:')
print(train['Survived'].value_counts(normalize=True))

Train shape: (891, 12)
Test shape: (418, 11)

Target distribution:
Survived
0    0.616162
1    0.383838
Name: proportion, dtype: float64


In [2]:
# Compare predictions from exp_000 and exp_001
import os

candidate_000 = pd.read_csv('/home/code/submission_candidates/candidate_000.csv')
candidate_001 = pd.read_csv('/home/code/submission_candidates/candidate_001.csv')

print('Prediction comparison:')
print(f'\nexp_000 (XGBoost baseline):')
print(candidate_000['Survived'].value_counts())
print(f'\nexp_001 (Voting ensemble):')
print(candidate_001['Survived'].value_counts())

# How many predictions differ?
diff = (candidate_000['Survived'] != candidate_001['Survived']).sum()
print(f'\nPredictions that differ: {diff} ({diff/len(candidate_000)*100:.1f}%)')

Prediction comparison:

exp_000 (XGBoost baseline):
Survived
0    267
1    151
Name: count, dtype: int64

exp_001 (Voting ensemble):
Survived
0    255
1    163
Name: count, dtype: int64

Predictions that differ: 32 (7.7%)


In [3]:
# Analyze which passengers have different predictions
merged = candidate_000.merge(candidate_001, on='PassengerId', suffixes=('_exp000', '_exp001'))
merged['diff'] = merged['Survived_exp000'] != merged['Survived_exp001']

# Merge with test data to see characteristics
test_with_preds = test.merge(merged, on='PassengerId')
diff_passengers = test_with_preds[test_with_preds['diff']]

print(f'Passengers with different predictions: {len(diff_passengers)}')
print(f'\nCharacteristics of differing predictions:')
print(diff_passengers[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']].describe())

Passengers with different predictions: 32

Characteristics of differing predictions:
          Pclass        Age      SibSp      Parch        Fare
count  32.000000  27.000000  32.000000  32.000000   32.000000
mean    2.375000  28.814815   0.312500   0.250000   30.585159
std     0.941858   8.837364   0.470929   0.567962   43.248920
min     1.000000   9.000000   0.000000   0.000000    6.950000
25%     1.000000  22.500000   0.000000   0.000000    8.050000
50%     3.000000  30.000000   0.000000   0.000000   14.477100
75%     3.000000  35.500000   1.000000   0.000000   27.046875
max     3.000000  45.000000   1.000000   2.000000  211.500000


In [4]:
# Analyze prediction changes by key features
print('Prediction changes by Sex:')
for sex in ['male', 'female']:
    subset = diff_passengers[diff_passengers['Sex'] == sex]
    if len(subset) > 0:
        changed_to_1 = (subset['Survived_exp001'] == 1).sum()
        changed_to_0 = (subset['Survived_exp001'] == 0).sum()
        print(f'  {sex}: {len(subset)} changed ({changed_to_1} to survived, {changed_to_0} to died)')

print('\nPrediction changes by Pclass:')
for pclass in [1, 2, 3]:
    subset = diff_passengers[diff_passengers['Pclass'] == pclass]
    if len(subset) > 0:
        changed_to_1 = (subset['Survived_exp001'] == 1).sum()
        changed_to_0 = (subset['Survived_exp001'] == 0).sum()
        print(f'  Pclass {pclass}: {len(subset)} changed ({changed_to_1} to survived, {changed_to_0} to died)')

Prediction changes by Sex:
  male: 11 changed (1 to survived, 10 to died)
  female: 21 changed (21 to survived, 0 to died)

Prediction changes by Pclass:
  Pclass 1: 10 changed (0 to survived, 10 to died)
  Pclass 3: 22 changed (22 to survived, 0 to died)


In [5]:
# CV-LB Gap Analysis
print('='*60)
print('CV-LB GAP ANALYSIS')
print('='*60)

print('\nexp_000 (XGBoost baseline):')
print(f'  CV: 0.8316')
print(f'  LB: 0.7584')
print(f'  Gap: +0.0732 (CV overestimates by 7.3%)')

print('\nexp_001 (Voting ensemble):')
print(f'  CV: 0.8372 (+0.0056 vs exp_000)')
print(f'  LB: UNKNOWN (not submitted)')
print(f'  Expected LB (if same gap): ~0.76-0.77')

print('\nKey question: Did the ensemble reduce the CV-LB gap?')
print('  - Lower variance (0.0239 vs 0.0324) suggests better generalization')
print('  - Simpler hyperparameters should reduce overfitting')
print('  - BUT we need LB feedback to confirm')

CV-LB GAP ANALYSIS

exp_000 (XGBoost baseline):
  CV: 0.8316
  LB: 0.7584
  Gap: +0.0732 (CV overestimates by 7.3%)

exp_001 (Voting ensemble):
  CV: 0.8372 (+0.0056 vs exp_000)
  LB: UNKNOWN (not submitted)
  Expected LB (if same gap): ~0.76-0.77

Key question: Did the ensemble reduce the CV-LB gap?
  - Lower variance (0.0239 vs 0.0324) suggests better generalization
  - Simpler hyperparameters should reduce overfitting
  - BUT we need LB feedback to confirm


In [6]:
# Strategic Decision Analysis
print('='*60)
print('STRATEGIC DECISION')
print('='*60)

print('\nOption 1: Submit exp_001 NOW')
print('  Pros:')
print('    - Get LB feedback to validate CV-LB gap')
print('    - Understand if ensemble approach is working')
print('    - 6 submissions remaining - can afford this')
print('  Cons:')
print('    - Uses 1 submission')

print('\nOption 2: Implement stacking first, then submit')
print('  Pros:')
print('    - Stacking achieved 0.808 LB in reference kernels')
print('    - Could get better score before submitting')
print('  Cons:')
print('    - Flying blind without LB feedback')
print('    - If stacking also has large gap, we wasted effort')

print('\n>>> RECOMMENDATION: Submit exp_001 first <<<')
print('Reason: We need to validate if our approach is reducing the gap')
print('before investing in more complex techniques.')

STRATEGIC DECISION

Option 1: Submit exp_001 NOW
  Pros:
    - Get LB feedback to validate CV-LB gap
    - Understand if ensemble approach is working
    - 6 submissions remaining - can afford this
  Cons:
    - Uses 1 submission

Option 2: Implement stacking first, then submit
  Pros:
    - Stacking achieved 0.808 LB in reference kernels
    - Could get better score before submitting
  Cons:
    - Flying blind without LB feedback
    - If stacking also has large gap, we wasted effort

>>> RECOMMENDATION: Submit exp_001 first <<<
Reason: We need to validate if our approach is reducing the gap
before investing in more complex techniques.


## Key Findings

1. **CV-LB Gap is the main issue** - 7.3% gap suggests significant overfitting
2. **Voting ensemble shows promise** - Lower variance, simpler models
3. **Need LB feedback** - Cannot proceed blindly

## Next Steps
1. **Submit exp_001** to get LB feedback
2. **If LB improves**: Implement stacking with diverse base models
3. **If LB doesn't improve**: Investigate distribution shift, try simpler models