# Loop 5 Analysis: Stacking vs Threshold-Tuned Ensemble

## Goal: Understand the differences between stacking and previous best, decide on submission strategy

### Key Questions:
1. How many predictions differ between stacking (exp_004) and threshold-tuned (exp_003)?
2. Who are the passengers where predictions differ?
3. Should we submit stacking at threshold 0.615 or try threshold 0.5 (higher CV)?

In [1]:
import pandas as pd
import numpy as np

# Load submissions
stacking = pd.read_csv('/home/submission/submission.csv')  # exp_004 stacking
thresh_tuned = pd.read_csv('/home/code/submission_candidates/candidate_003.csv')  # exp_003 threshold-tuned
simple_rf = pd.read_csv('/home/code/submission_candidates/candidate_001.csv')  # exp_001 simple RF

print("Submission shapes:")
print(f"  Stacking: {stacking.shape}")
print(f"  Threshold-Tuned: {thresh_tuned.shape}")
print(f"  Simple RF: {simple_rf.shape}")

print("\nSurvival rates:")
print(f"  Stacking: {stacking['Survived'].mean():.3f} ({stacking['Survived'].sum()} survivors)")
print(f"  Threshold-Tuned: {thresh_tuned['Survived'].mean():.3f} ({thresh_tuned['Survived'].sum()} survivors)")
print(f"  Simple RF: {simple_rf['Survived'].mean():.3f} ({simple_rf['Survived'].sum()} survivors)")

Submission shapes:
  Stacking: (418, 2)
  Threshold-Tuned: (418, 2)
  Simple RF: (418, 2)

Survival rates:
  Stacking: 0.313 (131 survivors)
  Threshold-Tuned: 0.311 (130 survivors)
  Simple RF: 0.313 (131 survivors)


In [2]:
# Compare stacking vs threshold-tuned
comparison = stacking.merge(thresh_tuned, on='PassengerId', suffixes=('_stack', '_thresh'))
differences = comparison[comparison['Survived_stack'] != comparison['Survived_thresh']]

print(f"\nPredictions that differ (Stacking vs Threshold-Tuned): {len(differences)}")
print(f"Agreement rate: {(len(comparison) - len(differences)) / len(comparison) * 100:.1f}%")

if len(differences) > 0:
    print(f"\nDifferences breakdown:")
    stack_1_thresh_0 = ((differences['Survived_stack'] == 1) & (differences['Survived_thresh'] == 0)).sum()
    stack_0_thresh_1 = ((differences['Survived_stack'] == 0) & (differences['Survived_thresh'] == 1)).sum()
    print(f"  Stacking=1, Threshold=0: {stack_1_thresh_0}")
    print(f"  Stacking=0, Threshold=1: {stack_0_thresh_1}")
    print(f"\nDiffering PassengerIds: {differences['PassengerId'].tolist()}")


Predictions that differ (Stacking vs Threshold-Tuned): 15
Agreement rate: 96.4%

Differences breakdown:
  Stacking=1, Threshold=0: 8
  Stacking=0, Threshold=1: 7

Differing PassengerIds: [896, 910, 913, 924, 961, 1045, 1084, 1089, 1092, 1098, 1117, 1175, 1197, 1199, 1284]


In [3]:
# Load test data to analyze differing passengers
test = pd.read_csv('/home/data/test.csv')

# Merge with differences
if len(differences) > 0:
    diff_passengers = test[test['PassengerId'].isin(differences['PassengerId'])]
    diff_passengers = diff_passengers.merge(differences[['PassengerId', 'Survived_stack', 'Survived_thresh']], on='PassengerId')
    
    print("\nDiffering passengers characteristics:")
    print(f"\nSex distribution:")
    print(diff_passengers['Sex'].value_counts())
    print(f"\nPclass distribution:")
    print(diff_passengers['Pclass'].value_counts())
    print(f"\nMean Age: {diff_passengers['Age'].mean():.1f}")
    print(f"\nDetailed view:")
    print(diff_passengers[['PassengerId', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Survived_stack', 'Survived_thresh']].to_string())


Differing passengers characteristics:

Sex distribution:
Sex
female    11
male       4
Name: count, dtype: int64

Pclass distribution:
Pclass
3    13
1     2
Name: count, dtype: int64

Mean Age: 26.0

Detailed view:
    PassengerId  Pclass     Sex    Age  SibSp  Parch      Fare  Survived_stack  Survived_thresh
0           896       3  female  22.00      1      1   12.2875               1                0
1           910       3  female  27.00      1      0    7.9250               1                0
2           913       3    male   9.00      0      1    3.1708               0                1
3           924       3  female  33.00      1      2   20.5750               1                0
4           961       1  female  60.00      1      4  263.0000               0                1
5          1045       3  female  36.00      0      2   12.1833               1                0
6          1084       3    male  11.50      1      1   14.5000               1                0
7          1089

In [4]:
# Compare stacking vs simple RF (our second-best LB)
comparison_rf = stacking.merge(simple_rf, on='PassengerId', suffixes=('_stack', '_rf'))
diff_rf = comparison_rf[comparison_rf['Survived_stack'] != comparison_rf['Survived_rf']]

print(f"\nPredictions that differ (Stacking vs Simple RF): {len(diff_rf)}")
print(f"Agreement rate: {(len(comparison_rf) - len(diff_rf)) / len(comparison_rf) * 100:.1f}%")

if len(diff_rf) > 0:
    print(f"\nDifferences breakdown:")
    stack_1_rf_0 = ((diff_rf['Survived_stack'] == 1) & (diff_rf['Survived_rf'] == 0)).sum()
    stack_0_rf_1 = ((diff_rf['Survived_stack'] == 0) & (diff_rf['Survived_rf'] == 1)).sum()
    print(f"  Stacking=1, RF=0: {stack_1_rf_0}")
    print(f"  Stacking=0, RF=1: {stack_0_rf_1}")


Predictions that differ (Stacking vs Simple RF): 24
Agreement rate: 94.3%

Differences breakdown:
  Stacking=1, RF=0: 12
  Stacking=0, RF=1: 12


In [5]:
# Analyze the differing passengers between stacking and RF
if len(diff_rf) > 0:
    diff_rf_passengers = test[test['PassengerId'].isin(diff_rf['PassengerId'])]
    diff_rf_passengers = diff_rf_passengers.merge(diff_rf[['PassengerId', 'Survived_stack', 'Survived_rf']], on='PassengerId')
    
    print("\nDiffering passengers (Stacking vs RF):")
    print(f"\nSex distribution:")
    print(diff_rf_passengers['Sex'].value_counts())
    print(f"\nPclass distribution:")
    print(diff_rf_passengers['Pclass'].value_counts())
    
    # Group by prediction direction
    stack_1_rf_0_passengers = diff_rf_passengers[diff_rf_passengers['Survived_stack'] == 1]
    stack_0_rf_1_passengers = diff_rf_passengers[diff_rf_passengers['Survived_rf'] == 1]
    
    print(f"\nStacking=1, RF=0 ({len(stack_1_rf_0_passengers)} passengers):")
    print(f"  Sex: {stack_1_rf_0_passengers['Sex'].value_counts().to_dict()}")
    print(f"  Pclass: {stack_1_rf_0_passengers['Pclass'].value_counts().to_dict()}")
    
    print(f"\nStacking=0, RF=1 ({len(stack_0_rf_1_passengers)} passengers):")
    print(f"  Sex: {stack_0_rf_1_passengers['Sex'].value_counts().to_dict()}")
    print(f"  Pclass: {stack_0_rf_1_passengers['Pclass'].value_counts().to_dict()}")


Differing passengers (Stacking vs RF):

Sex distribution:
Sex
female    15
male       9
Name: count, dtype: int64

Pclass distribution:
Pclass
3    20
1     4
Name: count, dtype: int64

Stacking=1, RF=0 (12 passengers):
  Sex: {'female': 6, 'male': 6}
  Pclass: {3: 12}

Stacking=0, RF=1 (12 passengers):
  Sex: {'female': 9, 'male': 3}
  Pclass: {3: 8, 1: 4}


In [6]:
# Summary: What should we submit?
print("="*70)
print("SUBMISSION DECISION ANALYSIS")
print("="*70)

print("\nSubmission History:")
print("  exp_000 (XGBoost): CV 0.8316, LB 0.7584, Gap +7.3%")
print("  exp_001 (Simple RF): CV 0.8238, LB 0.7775, Gap +4.6%")
print("  exp_003 (Threshold-Tuned): CV 0.8373, LB 0.7847, Gap +5.3%")

print("\nCurrent candidates:")
print("  exp_004 (Stacking, threshold 0.615): CV 0.8373, 131 survivors (31.3%)")
print("  exp_004 (Stacking, threshold 0.5): CV 0.8496, 133 survivors (31.8%)")

print("\nKey observations:")
print(f"  1. Stacking at threshold 0.615 has SAME CV as Threshold-Tuned (0.8373)")
print(f"  2. Stacking at threshold 0.5 has HIGHER CV (0.8496) but slightly more survivors")
print(f"  3. Stacking differs from Threshold-Tuned on {len(differences)} predictions")
print(f"  4. Stacking differs from Simple RF on {len(diff_rf)} predictions")

print("\nRecommendation:")
print("  Submit stacking at threshold 0.615 first (matches proven survival rate)")
print("  If no improvement, try stacking at threshold 0.5 (higher CV)")
print("  5 submissions remaining - can afford to test both")

SUBMISSION DECISION ANALYSIS

Submission History:
  exp_000 (XGBoost): CV 0.8316, LB 0.7584, Gap +7.3%
  exp_001 (Simple RF): CV 0.8238, LB 0.7775, Gap +4.6%
  exp_003 (Threshold-Tuned): CV 0.8373, LB 0.7847, Gap +5.3%

Current candidates:
  exp_004 (Stacking, threshold 0.615): CV 0.8373, 131 survivors (31.3%)
  exp_004 (Stacking, threshold 0.5): CV 0.8496, 133 survivors (31.8%)

Key observations:
  1. Stacking at threshold 0.615 has SAME CV as Threshold-Tuned (0.8373)
  2. Stacking at threshold 0.5 has HIGHER CV (0.8496) but slightly more survivors
  3. Stacking differs from Threshold-Tuned on 15 predictions
  4. Stacking differs from Simple RF on 24 predictions

Recommendation:
  Submit stacking at threshold 0.615 first (matches proven survival rate)
  If no improvement, try stacking at threshold 0.5 (higher CV)
  5 submissions remaining - can afford to test both
