# Loop 1 Analysis: Understanding the Ceiling and Next Steps

## Current Status
- Best CV: 0.8316 (exp_000 - XGBoost baseline)
- No LB submissions yet
- Target: 1.0 (100% accuracy)

## Key Questions
1. What is the realistic ceiling for this competition?
2. What features haven't been explored yet?
3. What ensemble strategies should we prioritize?

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
import warnings
warnings.filterwarnings('ignore')

# Load data
train = pd.read_csv('/home/data/train.csv')
test = pd.read_csv('/home/data/test.csv')

print(f"Train shape: {train.shape}")
print(f"Test shape: {test.shape}")
print(f"\nTarget distribution:")
print(train['Survived'].value_counts())

Train shape: (891, 12)
Test shape: (418, 11)

Target distribution:
Survived
0    549
1    342
Name: count, dtype: int64


## Feature Engineering Opportunities Not Yet Explored

1. **Deck extraction from Cabin** - Only Has_Cabin was used, but Deck letter has predictive value
2. **Ticket features** - Ticket prefix and shared ticket frequency
3. **Age binning** - Continuous Age was used, binned Age often helps
4. **Family survival patterns** - Passengers with same surname/ticket may have correlated survival
5. **Name length** - Longer names may indicate higher social status

In [2]:
# Analyze Cabin/Deck distribution
print("Cabin analysis:")
print(f"Missing: {train['Cabin'].isna().sum()} ({train['Cabin'].isna().mean()*100:.1f}%)")

# Extract deck from cabin
train['Deck'] = train['Cabin'].apply(lambda x: x[0] if pd.notna(x) else 'U')
print(f"\nDeck distribution:")
print(train['Deck'].value_counts())

# Survival by deck
print(f"\nSurvival rate by Deck:")
print(train.groupby('Deck')['Survived'].agg(['mean', 'count']).sort_values('mean', ascending=False))

Cabin analysis:
Missing: 687 (77.1%)

Deck distribution:
Deck
U    687
C     59
B     47
D     33
E     32
A     15
F     13
G      4
T      1
Name: count, dtype: int64

Survival rate by Deck:
          mean  count
Deck                 
D     0.757576     33
E     0.750000     32
B     0.744681     47
F     0.615385     13
C     0.593220     59
G     0.500000      4
A     0.466667     15
U     0.299854    687
T     0.000000      1


In [3]:
# Ticket analysis
print("Ticket analysis:")
print(f"Unique tickets: {train['Ticket'].nunique()}")

# Ticket frequency (shared tickets)
ticket_counts = train['Ticket'].value_counts()
print(f"\nTicket sharing distribution:")
print(ticket_counts.value_counts().sort_index())

# Add ticket frequency
train['Ticket_Freq'] = train['Ticket'].map(ticket_counts)
print(f"\nSurvival by ticket frequency:")
print(train.groupby('Ticket_Freq')['Survived'].agg(['mean', 'count']).head(10))

Ticket analysis:
Unique tickets: 681

Ticket sharing distribution:
count
1    547
2     94
3     21
4     11
5      2
6      3
7      3
Name: count, dtype: int64

Survival by ticket frequency:
                 mean  count
Ticket_Freq                 
1            0.297989    547
2            0.574468    188
3            0.698413     63
4            0.500000     44
5            0.000000     10
6            0.000000     18
7            0.238095     21


In [4]:
# Extract ticket prefix
import re

def extract_ticket_prefix(ticket):
    """Extract prefix from ticket (letters before numbers)"""
    match = re.match(r'^([A-Za-z./]+)', ticket)
    if match:
        return match.group(1).replace('.', '').replace('/', '')
    return 'NONE'

train['Ticket_Prefix'] = train['Ticket'].apply(extract_ticket_prefix)
print("Ticket prefix distribution:")
print(train['Ticket_Prefix'].value_counts().head(15))

print(f"\nSurvival by ticket prefix (top 10):")
prefix_survival = train.groupby('Ticket_Prefix')['Survived'].agg(['mean', 'count'])
print(prefix_survival[prefix_survival['count'] >= 10].sort_values('mean', ascending=False).head(10))

Ticket prefix distribution:
Ticket_Prefix
NONE       661
PC          60
CA          41
A           28
STONO       18
SOTONOQ     15
WC          10
SCPARIS      7
SOC          6
C            5
FCC          5
LINE         4
SCParis      4
SOPP         3
SCAH         3
Name: count, dtype: int64

Survival by ticket prefix (top 10):
                   mean  count
Ticket_Prefix                 
PC             0.650000     60
STONO          0.444444     18
NONE           0.384266    661
CA             0.341463     41
SOTONOQ        0.133333     15
WC             0.100000     10
A              0.071429     28


In [5]:
# Family surname analysis
train['Surname'] = train['Name'].apply(lambda x: x.split(',')[0])
print(f"Unique surnames: {train['Surname'].nunique()}")

# Surname frequency
surname_counts = train['Surname'].value_counts()
print(f"\nSurname frequency distribution:")
print(surname_counts.value_counts().sort_index().head(10))

# Families with multiple members
large_families = surname_counts[surname_counts > 1]
print(f"\nFamilies with >1 member: {len(large_families)}")

# Survival patterns in families
train['Surname_Freq'] = train['Surname'].map(surname_counts)
print(f"\nSurvival by surname frequency:")
print(train.groupby('Surname_Freq')['Survived'].agg(['mean', 'count']).head(10))

Unique surnames: 667

Surname frequency distribution:
count
1    534
2     83
3     28
4     14
5      1
6      5
7      1
9      1
Name: count, dtype: int64

Families with >1 member: 133

Survival by surname frequency:
                  mean  count
Surname_Freq                 
1             0.359551    534
2             0.524096    166
3             0.357143     84
4             0.428571     56
5             0.000000      5
6             0.233333     30
7             0.000000      7
9             0.222222      9


In [6]:
# Age binning analysis
train['Age_Bin'] = pd.cut(train['Age'], bins=[0, 16, 32, 48, 64, 100], labels=['Child', 'Young', 'Middle', 'Senior', 'Elderly'])
print("Survival by Age bin:")
print(train.groupby('Age_Bin')['Survived'].agg(['mean', 'count']))

Survival by Age bin:
             mean  count
Age_Bin                 
Child    0.550000    100
Young    0.369942    346
Middle   0.404255    188
Senior   0.434783     69
Elderly  0.090909     11


In [7]:
# Name length analysis
train['Name_Length'] = train['Name'].apply(len)
print(f"Name length stats:")
print(train['Name_Length'].describe())

# Correlation with survival
print(f"\nCorrelation of Name_Length with Survived: {train['Name_Length'].corr(train['Survived']):.4f}")

# Binned name length
train['Name_Length_Bin'] = pd.qcut(train['Name_Length'], q=4, labels=['Short', 'Medium', 'Long', 'VeryLong'])
print(f"\nSurvival by Name Length bin:")
print(train.groupby('Name_Length_Bin')['Survived'].agg(['mean', 'count']))

Name length stats:
count    891.000000
mean      26.965208
std        9.281607
min       12.000000
25%       20.000000
50%       25.000000
75%       30.000000
max       82.000000
Name: Name_Length, dtype: float64

Correlation of Name_Length with Survived: 0.3323

Survival by Name Length bin:
                     mean  count
Name_Length_Bin                 
Short            0.230453    243
Medium           0.325581    215
Long             0.364929    211
VeryLong         0.626126    222


## Summary of Feature Engineering Opportunities

Based on analysis:
1. **Deck** - Strong signal: Deck B, D, E have ~74-75% survival vs U (unknown) ~30%
2. **Ticket_Freq** - Moderate signal: Shared tickets correlate with family survival
3. **Ticket_Prefix** - Some signal but many categories
4. **Name_Length** - Weak correlation (~0.11) but worth trying
5. **Age_Bin** - Children have higher survival (58%)

## Next Steps Priority
1. Submit baseline to get LB feedback
2. Add Deck feature (quick win)
3. Build voting ensemble with diverse models
4. Consider stacking approach