# Loop 3 Analysis: What's Working and What to Try Next

**Goal:** Understand why stacking underperformed and identify high-impact improvements

**Key findings from new kernels:**
1. Top 4% kernel: Outlier removal + GridSearchCV tuning + soft voting
2. Advanced FE kernel (0.837 LB): Family/Ticket survival rate encoding

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import warnings
warnings.filterwarnings('ignore')

train = pd.read_csv('/home/data/train.csv')
test = pd.read_csv('/home/data/test.csv')

print(f'Train: {train.shape}, Test: {test.shape}')

## 1. Analyze Family/Ticket Survival Rates

The advanced FE kernel achieved 0.837 LB by encoding survival rates based on:
- Family surname (people with same surname traveled together)
- Ticket number (people with same ticket traveled together)

In [None]:
# Extract surname from Name
train['Surname'] = train['Name'].apply(lambda x: x.split(',')[0])
test['Surname'] = test['Name'].apply(lambda x: x.split(',')[0])

# Family survival rate (from training data only)
family_survival = train.groupby('Surname')['Survived'].agg(['mean', 'count']).reset_index()
family_survival.columns = ['Surname', 'Family_Survival_Rate', 'Family_Count']

print('Family groups with multiple members:')
print(family_survival[family_survival['Family_Count'] > 1].sort_values('Family_Count', ascending=False).head(20))

In [None]:
# Ticket survival rate (from training data only)
ticket_survival = train.groupby('Ticket')['Survived'].agg(['mean', 'count']).reset_index()
ticket_survival.columns = ['Ticket', 'Ticket_Survival_Rate', 'Ticket_Count']

print('\nTicket groups with multiple passengers:')
print(ticket_survival[ticket_survival['Ticket_Count'] > 1].sort_values('Ticket_Count', ascending=False).head(20))

In [None]:
# Check how many test passengers have family/ticket matches in train
train_surnames = set(train['Surname'].unique())
test_surnames = set(test['Surname'].unique())
shared_surnames = train_surnames.intersection(test_surnames)

train_tickets = set(train['Ticket'].unique())
test_tickets = set(test['Ticket'].unique())
shared_tickets = train_tickets.intersection(test_tickets)

print(f'\nSurname overlap:')
print(f'  Train unique surnames: {len(train_surnames)}')
print(f'  Test unique surnames: {len(test_surnames)}')
print(f'  Shared surnames: {len(shared_surnames)}')
print(f'  Test passengers with shared surname: {test["Surname"].isin(shared_surnames).sum()} / {len(test)}')

print(f'\nTicket overlap:')
print(f'  Train unique tickets: {len(train_tickets)}')
print(f'  Test unique tickets: {len(test_tickets)}')
print(f'  Shared tickets: {len(shared_tickets)}')
print(f'  Test passengers with shared ticket: {test["Ticket"].isin(shared_tickets).sum()} / {len(test)}')

## 2. Outlier Detection (from Top 4% kernel)

The Top 4% kernel removes outliers using Tukey's IQR method

In [None]:
from collections import Counter

def detect_outliers(df, n, features):
    """Detect outliers using Tukey IQR method"""
    outlier_indices = []
    
    for col in features:
        Q1 = np.percentile(df[col].dropna(), 25)
        Q3 = np.percentile(df[col].dropna(), 75)
        IQR = Q3 - Q1
        outlier_step = 1.5 * IQR
        
        outlier_list_col = df[(df[col] < Q1 - outlier_step) | (df[col] > Q3 + outlier_step)].index
        outlier_indices.extend(outlier_list_col)
    
    # Select observations with more than n outliers
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(k for k, v in outlier_indices.items() if v > n)
    
    return multiple_outliers

# Detect outliers from Age, SibSp, Parch, Fare
outliers = detect_outliers(train, 2, ['Age', 'SibSp', 'Parch', 'Fare'])
print(f'Outliers detected: {len(outliers)}')
print(f'\nOutlier rows:')
print(train.loc[outliers][['Name', 'Age', 'SibSp', 'Parch', 'Fare', 'Survived']])

## 3. Compare Experiment Results

Let's understand why stacking underperformed

In [None]:
# Load predictions from different experiments
import os

print('Experiment comparison:')
print('='*60)
print(f'exp_000 (XGBoost Baseline):     CV=0.8316, LB=0.7584')
print(f'exp_001 (Voting Ensemble):      CV=0.8372, LB=0.7727')
print(f'exp_002 (Stacking):             CV=0.8293, LB=N/A (not submitted)')
print('='*60)
print(f'\nCV-LB gap analysis:')
print(f'  exp_000: gap = 0.8316 - 0.7584 = 0.0732')
print(f'  exp_001: gap = 0.8372 - 0.7727 = 0.0645')
print(f'  Calibration: LB ≈ 2.55*CV - 1.37')
print(f'\nTo achieve LB 0.80, need CV ≈ {(0.80 + 1.37) / 2.55:.4f}')

## 4. Key Insights for Next Experiment

### What's NOT working:
1. **Stacking** - OOF predictions too correlated, meta-learner can't add value
2. **More models** - Adding more similar models doesn't help

### What COULD work:
1. **Family/Ticket survival rate encoding** - Top kernel technique
2. **Outlier removal** - Remove 10 extreme outliers
3. **Better hyperparameter tuning** - GridSearchCV on best models
4. **Interaction features** - Sex × Pclass, etc.

### Priority:
1. Add Family_Survival_Rate and Ticket_Survival_Rate features
2. Try outlier removal
3. Tune hyperparameters with GridSearchCV

In [None]:
# Quick test: Does family survival rate have predictive power?
# Merge family survival rate back to train
train_with_family = train.merge(family_survival, on='Surname', how='left')

# For passengers with family count > 1, check correlation
multi_family = train_with_family[train_with_family['Family_Count'] > 1]
print(f'Passengers in multi-member families: {len(multi_family)}')
print(f'Correlation between Family_Survival_Rate and Survived: {multi_family["Family_Survival_Rate"].corr(multi_family["Survived"]):.4f}')

# Same for ticket
train_with_ticket = train.merge(ticket_survival, on='Ticket', how='left')
multi_ticket = train_with_ticket[train_with_ticket['Ticket_Count'] > 1]
print(f'\nPassengers with shared tickets: {len(multi_ticket)}')
print(f'Correlation between Ticket_Survival_Rate and Survived: {multi_ticket["Ticket_Survival_Rate"].corr(multi_ticket["Survived"]):.4f}')