Titanic Survival Prediction - Exploratory Data Analysis
========================================================
This script performs comprehensive EDA with visualizations.

In [40]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [41]:
# Set style for all plots
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

Load in the data

In [42]:
# Load data (handle UTF-16 encoding from sample files)
try:
    train_df = pd.read_csv('../data/raw/train.csv', encoding='utf-16')
except:
    # Fallback for standard CSV
    train_df = pd.read_csv('../data/raw/train.csv')

Dataset Overview

In [43]:
print(f"\nDataset Shape: {train_df.shape[0]} rows, {train_df.shape[1]} columns")
print(f"\nColumn Names & Types:")
print(train_df.dtypes)

print(f"\n--- First 5 Rows ---")
print(train_df.head())

print(f"\n--- Statistical Summary (Numerical) ---")
print(train_df.describe())

print(f"\n--- Statistical Summary (Categorical) ---")
print(train_df.describe(include=['object']))


Dataset Shape: 891 rows, 12 columns

Column Names & Types:
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

--- First 5 Rows ---
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                         

Missing Value Analysis

In [44]:
missing = train_df.isnull().sum()
missing_pct = (missing / len(train_df) * 100).round(2)
missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Missing %': missing_pct
}).sort_values('Missing Count', ascending=False)

print("\nMissing Values by Column:")
print(missing_df[missing_df['Missing Count'] > 0])

# Visualization: Missing Values
fig, ax = plt.subplots(figsize=(10, 6))
colors = ['#e74c3c' if x > 0 else '#2ecc71' for x in missing.values]
missing.plot(kind='bar', ax=ax, color=colors)
ax.set_title('Missing Values by Feature', fontsize=14, fontweight='bold')
ax.set_xlabel('Feature')
ax.set_ylabel('Count of Missing Values')
ax.axhline(y=0, color='black', linewidth=0.5)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.savefig('../reports/figures/missing_by_feature.png')
plt.close()


Missing Values by Column:
          Missing Count  Missing %
Cabin               687      77.10
Age                 177      19.87
Embarked              2       0.22


Target Variable Analysis

In [45]:
survival_counts = train_df['Survived'].value_counts()
survival_pct = train_df['Survived'].value_counts(normalize=True) * 100

print(f"\nSurvival Distribution:")
print(f"  Did Not Survive (0): {survival_counts[0]} ({survival_pct[0]:.1f}%)")
print(f"  Survived (1):        {survival_counts[1]} ({survival_pct[1]:.1f}%)")
print(f"\nOverall Survival Rate: {survival_pct[1]:.1f}%")

# Visualization: Survival Distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Count plot
colors = ['#e74c3c', '#2ecc71']
survival_counts.plot(kind='bar', ax=axes[0], color=colors)
axes[0].set_title('Survival Count', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Survived')
axes[0].set_ylabel('Count')
axes[0].set_xticklabels(['No (0)', 'Yes (1)'], rotation=0)
for i, v in enumerate(survival_counts.values):
    axes[0].text(i, v + 10, str(v), ha='center', fontweight='bold')

# Pie chart
axes[1].pie(survival_counts, labels=['Did Not Survive', 'Survived'], 
            autopct='%1.1f%%', colors=colors, explode=(0.02, 0.02),
            shadow=True, startangle=90)
axes[1].set_title('Survival Proportion', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.savefig('../reports/figures/survival_distribution.png')
plt.close()


Survival Distribution:
  Did Not Survive (0): 549 (61.6%)
  Survived (1):        342 (38.4%)

Overall Survival Rate: 38.4%


Survival by Categorical features

In [46]:
# Survival by sex

sex_survival = train_df.groupby('Sex')['Survived'].agg(['sum','count'])
sex_survival['survival_rate'] = (sex_survival['sum'] / sex_survival['count'] * 100).round(1)
sex_survival.columns = ['Survived', 'Total', 'Survival Rate %']
print(sex_survival)

# Survival by Passenger Class

class_survival = train_df.groupby('Pclass')['Survived'].agg(['sum', 'count'])
class_survival['survival_rate'] = (class_survival['sum'] / class_survival['count'] * 100).round(1)
class_survival.columns = ['Survived', 'Total', 'Survival Rate %']
print(class_survival)

# Survival by Embarkation Port

embarked_survival = train_df.groupby('Embarked')['Survived'].agg(['sum', 'count'])
embarked_survival['survival_rate'] = (embarked_survival['sum'] / embarked_survival['count'] * 100).round(1)
embarked_survival.columns = ['Survived', 'Total', 'Survival Rate %']
print(embarked_survival)
print("(C=Cherbourg, Q=Queenstown, S=Southampton)")

        Survived  Total  Survival Rate %
Sex                                     
female       233    314             74.2
male         109    577             18.9
        Survived  Total  Survival Rate %
Pclass                                  
1            136    216             63.0
2             87    184             47.3
3            119    491             24.2
          Survived  Total  Survival Rate %
Embarked                                  
C               93    168             55.4
Q               30     77             39.0
S              217    644             33.7
(C=Cherbourg, Q=Queenstown, S=Southampton)


Visualizations of Categorial Features

In [47]:
figs, axes = plt.subplots(2, 3, figsize=(15, 10))

# Row 1: Count plots
sns.countplot(data=train_df, x='Sex', hue='Survived', ax=axes[0, 0], palette=['#e74c3c', '#2ecc71'])
axes[0, 0].set_title('Survival by Sex (Counts)', fontweight='bold')
axes[0, 0].legend(['No', 'Yes'], title='Survived')

sns.countplot(data=train_df, x='Pclass', hue='Survived', ax=axes[0, 1], palette=['#e74c3c', '#2ecc71'])
axes[0, 1].set_title('Survival by Class (Counts)', fontweight='bold')
axes[0, 1].legend(['No', 'Yes'], title='Survived')

sns.countplot(data=train_df, x='Embarked', hue='Survived', ax=axes[0, 2], palette=['#e74c3c', '#2ecc71'])
axes[0, 2].set_title('Survival by Embarkation (Counts)', fontweight='bold')
axes[0, 2].legend(['No', 'Yes'], title='Survived')

# Row 2: Survival rate plots
sex_rates = train_df.groupby('Sex')['Survived'].mean() * 100
sex_rates.plot(kind='bar', ax=axes[1, 0], color=['#3498db', '#e91e63'])
axes[1, 0].set_title('Survival Rate by Sex', fontweight='bold')
axes[1, 0].set_ylabel('Survival Rate (%)')
axes[1, 0].set_xticklabels(['Female', 'Male'], rotation=0)
axes[1, 0].axhline(y=survival_pct[1], color='red', linestyle='--', label=f'Overall: {survival_pct[1]:.1f}%')
for i, v in enumerate(sex_rates.values):
    axes[1, 0].text(i, v + 2, f'{v:.1f}%', ha='center', fontweight='bold')

class_rates = train_df.groupby('Pclass')['Survived'].mean() * 100
class_rates.plot(kind='bar', ax=axes[1, 1], color=['#f1c40f', '#9b59b6', '#1abc9c'])
axes[1, 1].set_title('Survival Rate by Class', fontweight='bold')
axes[1, 1].set_ylabel('Survival Rate (%)')
axes[1, 1].set_xticklabels(['1st', '2nd', '3rd'], rotation=0)
axes[1, 1].axhline(y=survival_pct[1], color='red', linestyle='--', label=f'Overall: {survival_pct[1]:.1f}%')
for i, v in enumerate(class_rates.values):
    axes[1, 1].text(i, v + 2, f'{v:.1f}%', ha='center', fontweight='bold')

embarked_rates = train_df.groupby('Embarked')['Survived'].mean() * 100
embarked_rates.plot(kind='bar', ax=axes[1, 2], color=['#e67e22', '#27ae60', '#2980b9'])
axes[1, 2].set_title('Survival Rate by Embarkation', fontweight='bold')
axes[1, 2].set_ylabel('Survival Rate (%)')
axes[1, 2].axhline(y=survival_pct[1], color='red', linestyle='--', label=f'Overall: {survival_pct[1]:.1f}%')
for i, v in enumerate(embarked_rates.values):
    axes[1, 2].text(i, v + 2, f'{v:.1f}%', ha='center', fontweight='bold')
    
plt.tight_layout()
plt.savefig('../reports/figures/survival_categorical.png')
plt.close()

Survival by Numerical Features

In [48]:
# Age Analysis

print(f"Mean Age: {train_df['Age'].mean():.1f}")
print(f"Median Age: {train_df['Age'].median():.1f}")
print(f"Age Range: {train_df['Age'].min():.1f} - {train_df['Age'].max():.1f}")
print(f"Missing Ages: {train_df['Age'].isnull().sum()} ({train_df['Age'].isnull().sum()/len(train_df)*100:.1f}%)")

# Create age groups 

train_df['AgeGroup'] = pd.cut(train_df['Age'], 
                              bins=[0, 12, 18, 35, 50, 65, 100],
                              labels=['Child (0-12)', 'Teen (13-18)', 'Young Adult (19-35)', 
                                     'Middle Age (36-50)', 'Senior (51-65)', 'Elderly (65+)'])


age_survival = train_df.groupby('AgeGroup')['Survived'].agg(['sum', 'count'])
age_survival['survival_rate'] = (age_survival['sum'] / age_survival['count'] * 100).round(1)
age_survival.columns = ['Survived', 'Total', 'Survival Rate %']
print(age_survival)

Mean Age: 29.7
Median Age: 28.0
Age Range: 0.4 - 80.0
Missing Ages: 177 (19.9%)
                     Survived  Total  Survival Rate %
AgeGroup                                             
Child (0-12)               40     69             58.0
Teen (13-18)               30     70             42.9
Young Adult (19-35)       137    358             38.3
Middle Age (36-50)         61    153             39.9
Senior (51-65)             21     56             37.5
Elderly (65+)               1      8             12.5


In [49]:
# Fare Analysis

print(f"Mean Fare: ${train_df['Fare'].mean():.2f}")
print(f"Median Fare: ${train_df['Fare'].median():.2f}")
print(f"Fare Range: ${train_df['Fare'].min():.2f} - ${train_df['Fare'].max():.2f}")


Mean Fare: $32.20
Median Fare: $14.45
Fare Range: $0.00 - $512.33


Visualizations for Numerical Features

In [50]:
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# Age distribution by survival
sns.histplot(data=train_df, x='Age', hue='Survived', kde=True, ax=axes[0, 0], 
             palette=['#e74c3c', '#2ecc71'], alpha=0.6)
axes[0, 0].set_title('Age Distribution by Survival', fontweight='bold')
axes[0, 0].legend(['No', 'Yes'], title='Survived')

# Age boxplot by survival
sns.boxplot(data=train_df, x='Survived', y='Age', ax=axes[0, 1], palette=['#e74c3c', '#2ecc71'])
axes[0, 1].set_title('Age by Survival Status', fontweight='bold')
axes[0, 1].set_xticklabels(['No', 'Yes'])

# Survival rate by age group
age_rates = train_df.groupby('AgeGroup')['Survived'].mean() * 100
age_rates.plot(kind='bar', ax=axes[0, 2], color='#3498db')
axes[0, 2].set_title('Survival Rate by Age Group', fontweight='bold')
axes[0, 2].set_ylabel('Survival Rate (%)')
axes[0, 2].tick_params(axis='x', rotation=45)
axes[0, 2].axhline(y=survival_pct[1], color='red', linestyle='--')
for i, v in enumerate(age_rates.values):
    if not np.isnan(v):
        axes[0, 2].text(i, v + 2, f'{v:.0f}%', ha='center', fontsize=9)

# Fare distribution by survival (log scale for better visualization)
sns.histplot(data=train_df, x='Fare', hue='Survived', kde=True, ax=axes[1, 0], 
             palette=['#e74c3c', '#2ecc71'], alpha=0.6)
axes[1, 0].set_title('Fare Distribution by Survival', fontweight='bold')
axes[1, 0].set_xlim(0, 150)  # Focus on main distribution
axes[1, 0].legend(['No', 'Yes'], title='Survived')

# Fare boxplot by survival
sns.boxplot(data=train_df, x='Survived', y='Fare', ax=axes[1, 1], palette=['#e74c3c', '#2ecc71'])
axes[1, 1].set_title('Fare by Survival Status', fontweight='bold')
axes[1, 1].set_xticklabels(['No', 'Yes'])
axes[1, 1].set_ylim(0, 150)

# Fare boxplot by class
sns.boxplot(data=train_df, x='Pclass', y='Fare', ax=axes[1, 2], palette='viridis')
axes[1, 2].set_title('Fare Distribution by Class', fontweight='bold')
axes[1, 2].set_xticklabels(['1st', '2nd', '3rd'])
axes[1, 2].set_ylim(0, 200)

plt.tight_layout()
plt.savefig('../reports/figures/survival_numerical.png')
plt.close()

Family Size Analysis

In [51]:
# Create family size feature
train_df['FamilySize'] = train_df['SibSp'] + train_df['Parch'] + 1
train_df['IsAlone'] = (train_df['FamilySize'] == 1).astype(int)

print("\n--- SibSp (Siblings/Spouses) Distribution ---")
print(train_df['SibSp'].value_counts().sort_index())

print("\n--- Parch (Parents/Children) Distribution ---")
print(train_df['Parch'].value_counts().sort_index())

print("\n--- Family Size Distribution ---")
print(train_df['FamilySize'].value_counts().sort_index())

print("\n--- Survival by Family Size ---")
family_survival = train_df.groupby('FamilySize')['Survived'].agg(['sum', 'count'])
family_survival['survival_rate'] = (family_survival['sum'] / family_survival['count'] * 100).round(1)
family_survival.columns = ['Survived', 'Total', 'Survival Rate %']
print(family_survival)

print("\n--- Alone vs With Family ---")
alone_survival = train_df.groupby('IsAlone')['Survived'].agg(['sum', 'count'])
alone_survival['survival_rate'] = (alone_survival['sum'] / alone_survival['count'] * 100).round(1)
alone_survival.index = ['With Family', 'Alone']
alone_survival.columns = ['Survived', 'Total', 'Survival Rate %']
print(alone_survival)



--- SibSp (Siblings/Spouses) Distribution ---
SibSp
0    608
1    209
2     28
3     16
4     18
5      5
8      7
Name: count, dtype: int64

--- Parch (Parents/Children) Distribution ---
Parch
0    678
1    118
2     80
3      5
4      4
5      5
6      1
Name: count, dtype: int64

--- Family Size Distribution ---
FamilySize
1     537
2     161
3     102
4      29
5      15
6      22
7      12
8       6
11      7
Name: count, dtype: int64

--- Survival by Family Size ---
            Survived  Total  Survival Rate %
FamilySize                                  
1                163    537             30.4
2                 89    161             55.3
3                 59    102             57.8
4                 21     29             72.4
5                  3     15             20.0
6                  3     22             13.6
7                  4     12             33.3
8                  0      6              0.0
11                 0      7              0.0

--- Alone vs With Family -

Visualizations for Family Size

In [52]:
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Family size distribution with survival
sns.countplot(data=train_df, x='FamilySize', hue='Survived', ax=axes[0], palette=['#e74c3c', '#2ecc71'])
axes[0].set_title('Survival by Family Size', fontweight='bold')
axes[0].legend(['No', 'Yes'], title='Survived')

# Survival rate by family size
family_rates = train_df.groupby('FamilySize')['Survived'].mean() * 100
family_rates.plot(kind='bar', ax=axes[1], color='#9b59b6')
axes[1].set_title('Survival Rate by Family Size', fontweight='bold')
axes[1].set_ylabel('Survival Rate (%)')
axes[1].set_xlabel('Family Size')
axes[1].axhline(y=survival_pct[1], color='red', linestyle='--')
for i, v in enumerate(family_rates.values):
    axes[1].text(i, v + 2, f'{v:.0f}%', ha='center', fontsize=9)

# Alone vs With Family
alone_rates = train_df.groupby('IsAlone')['Survived'].mean() * 100
alone_rates.index = ['With Family', 'Alone']
alone_rates.plot(kind='bar', ax=axes[2], color=['#2ecc71', '#e74c3c'])
axes[2].set_title('Survival Rate: Alone vs With Family', fontweight='bold')
axes[2].set_ylabel('Survival Rate (%)')
axes[2].set_xticklabels(['With Family', 'Alone'], rotation=0)
for i, v in enumerate(alone_rates.values):
    axes[2].text(i, v + 2, f'{v:.1f}%', ha='center', fontweight='bold')

plt.tight_layout()
plt.savefig('../reports/figures/family_size.png')
plt.close()

Combined Features Analysis

In [53]:
# Sex and Class combined
print("\n--- Survival by Sex AND Class ---")
sex_class = train_df.groupby(['Sex', 'Pclass'])['Survived'].agg(['sum', 'count'])
sex_class['survival_rate'] = (sex_class['sum'] / sex_class['count'] * 100).round(1)
sex_class.columns = ['Survived', 'Total', 'Survival Rate %']
print(sex_class)


--- Survival by Sex AND Class ---
               Survived  Total  Survival Rate %
Sex    Pclass                                  
female 1             91     94             96.8
       2             70     76             92.1
       3             72    144             50.0
male   1             45    122             36.9
       2             17    108             15.7
       3             47    347             13.5


Combined Feature Visualizations

In [54]:
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Sex and Class heatmap
pivot_table = train_df.pivot_table(values='Survived', index='Sex', columns='Pclass', aggfunc='mean') * 100
sns.heatmap(pivot_table, annot=True, fmt='.1f', cmap='RdYlGn', ax=axes[0], 
            cbar_kws={'label': 'Survival Rate (%)'})
axes[0].set_title('Survival Rate: Sex × Class', fontweight='bold')
axes[0].set_xticklabels(['1st', '2nd', '3rd'])

# Age and Class
sns.boxplot(data=train_df, x='Pclass', y='Age', hue='Survived', ax=axes[1], palette=['#e74c3c', '#2ecc71'])
axes[1].set_title('Age Distribution by Class & Survival', fontweight='bold')
axes[1].set_xticklabels(['1st', '2nd', '3rd'])
axes[1].legend(['No', 'Yes'], title='Survived')

# Sex, Class, and Embarkation
pivot_embarked = train_df.pivot_table(values='Survived', index='Embarked', columns='Pclass', aggfunc='mean') * 100
sns.heatmap(pivot_embarked, annot=True, fmt='.1f', cmap='RdYlGn', ax=axes[2],
            cbar_kws={'label': 'Survival Rate (%)'})
axes[2].set_title('Survival Rate: Embarkation × Class', fontweight='bold')
axes[2].set_xticklabels(['1st', '2nd', '3rd'])

plt.tight_layout()
plt.savefig('../reports/figures/combined_features.png')
plt.close()

Correlation Analysis

In [55]:
# Create numeric version for correlation
train_numeric = train_df.copy()
train_numeric['Sex_numeric'] = (train_numeric['Sex'] == 'male').astype(int)

# Select columns for correlation
corr_cols = ['Survived', 'Pclass', 'Sex_numeric', 'Age', 'SibSp', 'Parch', 'Fare', 'FamilySize', 'IsAlone']
correlation_matrix = train_numeric[corr_cols].corr()

print("\nCorrelation with Survival:")
survival_corr = correlation_matrix['Survived'].drop('Survived').sort_values(ascending=False)
for feat, corr in survival_corr.items():
    direction = "+" if corr > 0 else ""
    print(f"  {feat:15s}: {direction}{corr:.3f}")


Correlation with Survival:
  Fare           : +0.257
  Parch          : +0.082
  FamilySize     : +0.017
  SibSp          : -0.035
  Age            : -0.077
  IsAlone        : -0.203
  Pclass         : -0.338
  Sex_numeric    : -0.543


Correlation Visualization

In [56]:
fig, ax = plt.subplots(figsize=(10, 8))
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))
sns.heatmap(correlation_matrix, mask=mask, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, ax=ax, square=True, linewidths=0.5)
ax.set_title('Feature Correlation Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('../reports/figures/correlation_figure.png')
plt.close()

Key Findings from Exploratory Data Analysis

1. SURVIVAL RATE
   - Overall survival rate: ~38.4% (minority class)
   - Dataset is imbalanced - consider this for model evaluation

2. GENDER (Strongest predictor)
   - Female survival: ~74%
   - Male survival: ~19%
   - "Women and children first" policy clearly visible

3. PASSENGER CLASS (Strong predictor)
   - 1st Class: ~63% survival
   - 2nd Class: ~47% survival  
   - 3rd Class: ~24% survival
   - Socioeconomic status strongly correlated with survival

4. AGE
   - Children (0-12) had higher survival rates (~58%)
   - Elderly passengers had lower survival rates
   - 177 missing values (20%) - needs imputation strategy

5. FAMILY SIZE
   - Medium families (2-4) had best survival rates
   - Solo travelers and very large families had lower survival
   - Being alone: ~30% survival vs With family: ~50%

6. FARE
   - Higher fares correlated with survival
   - Likely confounded with passenger class

7. EMBARKATION
   - Cherbourg (C): ~55% survival (more 1st class passengers)
   - Queenstown (Q): ~39% survival
   - Southampton (S): ~34% survival

8. MISSING DATA PRIORITIES
   - Age: 177 missing (20%) - imputation needed
   - Cabin: 687 missing (77%) - consider dropping or engineering
   - Embarked: 2 missing - simple imputation (mode)

9. FEATURE ENGINEERING IDEAS
   - Extract title from Name (Mr, Mrs, Miss, Master, etc.)
   - Create FamilySize = SibSp + Parch + 1
   - Create IsAlone flag
   - Bin Age into groups
   - Extract Deck from Cabin (first letter)