# Loop 6 Analysis: Feature Engineering Results

## Goal: Analyze exp_005 (Feature Engineering) and decide on next steps

### Key Questions:
1. How does exp_005 compare to previous best (exp_003)?
2. What predictions changed and why?
3. Should we submit exp_005?

In [1]:
import pandas as pd
import numpy as np

# Load all candidate submissions
candidates = {}
for i in range(6):
    try:
        df = pd.read_csv(f'/home/code/submission_candidates/candidate_00{i}.csv')
        candidates[f'exp_{i:03d}'] = df
        print(f"Loaded candidate_{i:03d}: {df['Survived'].sum()} survivors ({df['Survived'].mean()*100:.1f}%)")
    except:
        pass

Loaded candidate_000: 157 survivors (37.6%)
Loaded candidate_001: 131 survivors (31.3%)
Loaded candidate_002: 157 survivors (37.6%)
Loaded candidate_003: 130 survivors (31.1%)
Loaded candidate_004: 131 survivors (31.3%)
Loaded candidate_005: 131 survivors (31.3%)


In [2]:
# Compare exp_005 (Feature Engineering) vs exp_003 (Best LB)
exp_003 = candidates['exp_003']  # Best LB: 0.7847
exp_005 = candidates['exp_005']  # New: Feature Engineering

# Merge to compare
comparison = exp_003.merge(exp_005, on='PassengerId', suffixes=('_003', '_005'))
comparison['diff'] = comparison['Survived_003'] != comparison['Survived_005']

print(f"\nComparison: exp_003 (Best LB 0.7847) vs exp_005 (Feature Engineering)")
print(f"="*60)
print(f"Total passengers: {len(comparison)}")
print(f"Same predictions: {(~comparison['diff']).sum()} ({(~comparison['diff']).mean()*100:.1f}%)")
print(f"Different predictions: {comparison['diff'].sum()} ({comparison['diff'].mean()*100:.1f}%)")

print(f"\nexp_003 survivors: {comparison['Survived_003'].sum()} ({comparison['Survived_003'].mean()*100:.1f}%)")
print(f"exp_005 survivors: {comparison['Survived_005'].sum()} ({comparison['Survived_005'].mean()*100:.1f}%)")


Comparison: exp_003 (Best LB 0.7847) vs exp_005 (Feature Engineering)
Total passengers: 418
Same predictions: 409 (97.8%)
Different predictions: 9 (2.2%)

exp_003 survivors: 130 (31.1%)
exp_005 survivors: 131 (31.3%)


In [3]:
# Load test data to analyze differing predictions
test = pd.read_csv('/home/data/test.csv')

# Get passengers where predictions differ
diff_ids = comparison[comparison['diff']]['PassengerId'].values
diff_passengers = test[test['PassengerId'].isin(diff_ids)].copy()

# Add predictions
diff_passengers = diff_passengers.merge(comparison[['PassengerId', 'Survived_003', 'Survived_005']], on='PassengerId')

print(f"\nPassengers with different predictions ({len(diff_passengers)}):")
print("="*60)
print(diff_passengers[['PassengerId', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Survived_003', 'Survived_005']].to_string())


Passengers with different predictions (9):
   PassengerId  Pclass     Sex   Age  SibSp  Parch      Fare  Survived_003  Survived_005
0          911       3  female  45.0      0      0    7.2250             0             1
1          913       3    male   9.0      0      1    3.1708             1             0
2          982       3  female  22.0      1      0   13.9000             0             1
3         1017       3  female  17.0      0      1   16.1000             1             0
4         1051       3  female  26.0      0      2   13.7750             1             0
5         1091       3  female   NaN      0      0    8.1125             0             1
6         1094       1    male  47.0      1      0  227.5250             0             1
7         1237       3  female  16.0      0      0    7.6500             1             0
8         1274       3  female   NaN      0      0   14.5000             0             1


In [4]:
# Analyze the differing predictions
print("\nAnalysis of differing predictions:")
print("="*60)

# exp_003=0, exp_005=1 (Feature Engineering predicts survival, Best LB predicts death)
fe_survival = diff_passengers[(diff_passengers['Survived_003'] == 0) & (diff_passengers['Survived_005'] == 1)]
print(f"\nexp_005 predicts SURVIVAL where exp_003 predicts DEATH ({len(fe_survival)} passengers):")
if len(fe_survival) > 0:
    print(f"  Sex: {fe_survival['Sex'].value_counts().to_dict()}")
    print(f"  Pclass: {fe_survival['Pclass'].value_counts().to_dict()}")
    print(f"  Mean Age: {fe_survival['Age'].mean():.1f}")
    print(f"  Mean Fare: {fe_survival['Fare'].mean():.1f}")
    print(f"  FamilySize: {(fe_survival['SibSp'] + fe_survival['Parch'] + 1).value_counts().to_dict()}")

# exp_003=1, exp_005=0 (Feature Engineering predicts death, Best LB predicts survival)
fe_death = diff_passengers[(diff_passengers['Survived_003'] == 1) & (diff_passengers['Survived_005'] == 0)]
print(f"\nexp_005 predicts DEATH where exp_003 predicts SURVIVAL ({len(fe_death)} passengers):")
if len(fe_death) > 0:
    print(f"  Sex: {fe_death['Sex'].value_counts().to_dict()}")
    print(f"  Pclass: {fe_death['Pclass'].value_counts().to_dict()}")
    print(f"  Mean Age: {fe_death['Age'].mean():.1f}")
    print(f"  Mean Fare: {fe_death['Fare'].mean():.1f}")
    print(f"  FamilySize: {(fe_death['SibSp'] + fe_death['Parch'] + 1).value_counts().to_dict()}")


Analysis of differing predictions:

exp_005 predicts SURVIVAL where exp_003 predicts DEATH (5 passengers):
  Sex: {'female': 4, 'male': 1}
  Pclass: {3: 4, 1: 1}
  Mean Age: 38.0
  Mean Fare: 54.3
  FamilySize: {1: 3, 2: 2}

exp_005 predicts DEATH where exp_003 predicts SURVIVAL (4 passengers):
  Sex: {'female': 3, 'male': 1}
  Pclass: {3: 4}
  Mean Age: 17.0
  Mean Fare: 10.2
  FamilySize: {2: 2, 3: 1, 1: 1}


In [5]:
# Historical pattern analysis
print("\n" + "="*60)
print("HISTORICAL PATTERN ANALYSIS")
print("="*60)

print("\nSubmission History:")
print(f"  exp_000: CV 0.8316, LB 0.7584 (157 survivors, 37.6%)")
print(f"  exp_001: CV 0.8238, LB 0.7775 (131 survivors, 31.3%)")
print(f"  exp_003: CV 0.8373, LB 0.7847 (130 survivors, 31.1%) <- BEST LB")
print(f"  exp_004: CV 0.8373, LB 0.7631 (131 survivors, 31.3%) <- Stacking FAILED")
print(f"  exp_005: CV 0.8395, LB ??? (131 survivors, 31.3%) <- Feature Engineering")

print("\nKey Patterns:")
print("  1. ~31% survival rate (130-133 survivors) is optimal")
print("  2. CV alone is NOT predictive (same CV gave LB 0.7847 vs 0.7631)")
print("  3. Simpler models tend to have smaller CV-LB gap")
print("  4. Stacking overfits on this small dataset")

print("\nexp_005 Assessment:")
print(f"  - CV improved: 0.8373 -> 0.8395 (+0.22%)")
print(f"  - Survival rate: 31.3% (matches optimal pattern)")
print(f"  - Approach: Feature engineering (simpler than stacking)")
print(f"  - Only 9 predictions differ from best LB model")


HISTORICAL PATTERN ANALYSIS

Submission History:
  exp_000: CV 0.8316, LB 0.7584 (157 survivors, 37.6%)
  exp_001: CV 0.8238, LB 0.7775 (131 survivors, 31.3%)
  exp_003: CV 0.8373, LB 0.7847 (130 survivors, 31.1%) <- BEST LB
  exp_004: CV 0.8373, LB 0.7631 (131 survivors, 31.3%) <- Stacking FAILED
  exp_005: CV 0.8395, LB ??? (131 survivors, 31.3%) <- Feature Engineering

Key Patterns:
  1. ~31% survival rate (130-133 survivors) is optimal
  2. CV alone is NOT predictive (same CV gave LB 0.7847 vs 0.7631)
  3. Simpler models tend to have smaller CV-LB gap
  4. Stacking overfits on this small dataset

exp_005 Assessment:
  - CV improved: 0.8373 -> 0.8395 (+0.22%)
  - Survival rate: 31.3% (matches optimal pattern)
  - Approach: Feature engineering (simpler than stacking)
  - Only 9 predictions differ from best LB model


In [6]:
# Decision analysis
print("\n" + "="*60)
print("DECISION ANALYSIS: Should we submit exp_005?")
print("="*60)

print("\nPROS:")
print("  + CV improved from 0.8373 to 0.8395 (+0.22%)")
print("  + Survival rate matches optimal pattern (31.3%)")
print("  + Feature engineering is simpler than stacking (which failed)")
print("  + Only 9 predictions differ - low risk")
print("  + New features (FamilySize, IsAlone, Has_Cabin) are proven predictors")

print("\nCONS:")
print("  - CV improvement is small (+0.22%)")
print("  - CV has been unreliable (same CV gave different LB scores)")
print("  - TicketFreq has minor leakage (computed from combined data)")
print("  - Only 4 submissions remaining")

print("\nEVALUATOR RECOMMENDATION:")
print("  Submit exp_005 to validate whether CV improvement translates to LB")
print("  Expected outcomes:")
print("    - If LB > 0.7847: Feature engineering is the right direction")
print("    - If LB ≈ 0.7847: Changes roughly cancel out")
print("    - If LB < 0.7847: New features may be adding noise")

print("\nMY ASSESSMENT:")
print("  SUBMIT exp_005")
print("  Rationale:")
print("    1. CV improved (even if small)")
print("    2. Survival rate matches optimal pattern")
print("    3. Approach is simpler than stacking (which failed)")
print("    4. Need LB feedback to guide next steps")
print("    5. 4 submissions remaining is enough for iteration")


DECISION ANALYSIS: Should we submit exp_005?

PROS:
  + CV improved from 0.8373 to 0.8395 (+0.22%)
  + Survival rate matches optimal pattern (31.3%)
  + Feature engineering is simpler than stacking (which failed)
  + Only 9 predictions differ - low risk
  + New features (FamilySize, IsAlone, Has_Cabin) are proven predictors

CONS:
  - CV improvement is small (+0.22%)
  - CV has been unreliable (same CV gave different LB scores)
  - TicketFreq has minor leakage (computed from combined data)
  - Only 4 submissions remaining

EVALUATOR RECOMMENDATION:
  Submit exp_005 to validate whether CV improvement translates to LB
  Expected outcomes:
    - If LB > 0.7847: Feature engineering is the right direction
    - If LB ≈ 0.7847: Changes roughly cancel out
    - If LB < 0.7847: New features may be adding noise

MY ASSESSMENT:
  SUBMIT exp_005
  Rationale:
    1. CV improved (even if small)
    2. Survival rate matches optimal pattern
    3. Approach is simpler than stacking (which failed)
   