# Loop 1 Analysis: Understanding the Ceiling and Next Steps

## Current Status
- Best CV: 0.8316 (exp_000 - XGBoost baseline)
- No LB submissions yet
- Target: 1.0 (100% accuracy)

## Key Questions
1. What is the realistic ceiling for this competition?
2. What features haven't been explored yet?
3. What ensemble strategies should we prioritize?

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
import warnings
warnings.filterwarnings('ignore')

# Load data
train = pd.read_csv('/home/data/train.csv')
test = pd.read_csv('/home/data/test.csv')

print(f"Train shape: {train.shape}")
print(f"Test shape: {test.shape}")
print(f"\nTarget distribution:")
print(train['Survived'].value_counts())

## Feature Engineering Opportunities Not Yet Explored

1. **Deck extraction from Cabin** - Only Has_Cabin was used, but Deck letter has predictive value
2. **Ticket features** - Ticket prefix and shared ticket frequency
3. **Age binning** - Continuous Age was used, binned Age often helps
4. **Family survival patterns** - Passengers with same surname/ticket may have correlated survival
5. **Name length** - Longer names may indicate higher social status

In [None]:
# Analyze Cabin/Deck distribution
print("Cabin analysis:")
print(f"Missing: {train['Cabin'].isna().sum()} ({train['Cabin'].isna().mean()*100:.1f}%)")

# Extract deck from cabin
train['Deck'] = train['Cabin'].apply(lambda x: x[0] if pd.notna(x) else 'U')
print(f"\nDeck distribution:")
print(train['Deck'].value_counts())

# Survival by deck
print(f"\nSurvival rate by Deck:")
print(train.groupby('Deck')['Survived'].agg(['mean', 'count']).sort_values('mean', ascending=False))

In [None]:
# Ticket analysis
print("Ticket analysis:")
print(f"Unique tickets: {train['Ticket'].nunique()}")

# Ticket frequency (shared tickets)
ticket_counts = train['Ticket'].value_counts()
print(f"\nTicket sharing distribution:")
print(ticket_counts.value_counts().sort_index())

# Add ticket frequency
train['Ticket_Freq'] = train['Ticket'].map(ticket_counts)
print(f"\nSurvival by ticket frequency:")
print(train.groupby('Ticket_Freq')['Survived'].agg(['mean', 'count']).head(10))

In [None]:
# Extract ticket prefix
import re

def extract_ticket_prefix(ticket):
    """Extract prefix from ticket (letters before numbers)"""
    match = re.match(r'^([A-Za-z./]+)', ticket)
    if match:
        return match.group(1).replace('.', '').replace('/', '')
    return 'NONE'

train['Ticket_Prefix'] = train['Ticket'].apply(extract_ticket_prefix)
print("Ticket prefix distribution:")
print(train['Ticket_Prefix'].value_counts().head(15))

print(f"\nSurvival by ticket prefix (top 10):")
prefix_survival = train.groupby('Ticket_Prefix')['Survived'].agg(['mean', 'count'])
print(prefix_survival[prefix_survival['count'] >= 10].sort_values('mean', ascending=False).head(10))

In [None]:
# Family surname analysis
train['Surname'] = train['Name'].apply(lambda x: x.split(',')[0])
print(f"Unique surnames: {train['Surname'].nunique()}")

# Surname frequency
surname_counts = train['Surname'].value_counts()
print(f"\nSurname frequency distribution:")
print(surname_counts.value_counts().sort_index().head(10))

# Families with multiple members
large_families = surname_counts[surname_counts > 1]
print(f"\nFamilies with >1 member: {len(large_families)}")

# Survival patterns in families
train['Surname_Freq'] = train['Surname'].map(surname_counts)
print(f"\nSurvival by surname frequency:")
print(train.groupby('Surname_Freq')['Survived'].agg(['mean', 'count']).head(10))

In [None]:
# Age binning analysis
train['Age_Bin'] = pd.cut(train['Age'], bins=[0, 16, 32, 48, 64, 100], labels=['Child', 'Young', 'Middle', 'Senior', 'Elderly'])
print("Survival by Age bin:")
print(train.groupby('Age_Bin')['Survived'].agg(['mean', 'count']))

In [None]:
# Name length analysis
train['Name_Length'] = train['Name'].apply(len)
print(f"Name length stats:")
print(train['Name_Length'].describe())

# Correlation with survival
print(f"\nCorrelation of Name_Length with Survived: {train['Name_Length'].corr(train['Survived']):.4f}")

# Binned name length
train['Name_Length_Bin'] = pd.qcut(train['Name_Length'], q=4, labels=['Short', 'Medium', 'Long', 'VeryLong'])
print(f"\nSurvival by Name Length bin:")
print(train.groupby('Name_Length_Bin')['Survived'].agg(['mean', 'count']))

## Summary of Feature Engineering Opportunities

Based on analysis:
1. **Deck** - Strong signal: Deck B, D, E have ~74-75% survival vs U (unknown) ~30%
2. **Ticket_Freq** - Moderate signal: Shared tickets correlate with family survival
3. **Ticket_Prefix** - Some signal but many categories
4. **Name_Length** - Weak correlation (~0.11) but worth trying
5. **Age_Bin** - Children have higher survival (58%)

## Next Steps Priority
1. Submit baseline to get LB feedback
2. Add Deck feature (quick win)
3. Build voting ensemble with diverse models
4. Consider stacking approach