# Titanic Dataset Exploratory Data Analysis

This notebook contains exploratory data analysis (EDA) of the Titanic dataset. We'll explore the features, their distributions, correlations, and insights to better understand the data before modeling.

## 1. Setup and Data Loading

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sys

# Add the project root to path so we can import modules
sys.path.append(os.path.abspath('../'))

# Import project modules
from src.data_processing.data_loader import DataLoader

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)

# Load data
data_loader = DataLoader()
train_data = data_loader.load_train_data()
test_data = data_loader.load_test_data()

# Display basic information
print(f"Training data shape: {train_data.shape}")
print(f"Testing data shape: {test_data.shape}")

## 2. Data Overview

In [None]:
# Display first few rows of the training data
train_data.head()

In [None]:
# Get summary statistics
train_data.describe(include='all')

In [None]:
# Check data types
train_data.info()

## 3. Missing Value Analysis

In [None]:
# Check missing values in training data
train_missing = train_data.isnull().sum().sort_values(ascending=False)
train_missing_percent = (train_missing / len(train_data) * 100).round(2)
train_missing_df = pd.DataFrame({'Missing Count': train_missing, 'Missing Percent': train_missing_percent})
print("Missing values in training data:")
train_missing_df[train_missing_df['Missing Count'] > 0]

In [None]:
# Check missing values in test data
test_missing = test_data.isnull().sum().sort_values(ascending=False)
test_missing_percent = (test_missing / len(test_data) * 100).round(2)
test_missing_df = pd.DataFrame({'Missing Count': test_missing, 'Missing Percent': test_missing_percent})
print("Missing values in testing data:")
test_missing_df[test_missing_df['Missing Count'] > 0]

In [None]:
# Visualize missing values
plt.figure(figsize=(12, 6))
sns.heatmap(train_data.isnull(), cbar=False, cmap='viridis', yticklabels=False)
plt.title('Missing Values in Training Data')
plt.tight_layout()
plt.show()

## 4. Survival Rate Analysis

In [None]:
# Overall survival rate
survival_rate = train_data['Survived'].mean() * 100
print(f"Overall survival rate: {survival_rate:.2f}%")

# Visualize survival distribution
plt.figure(figsize=(8, 6))
sns.countplot(x='Survived', data=train_data, palette='Set2')
plt.title('Survival Distribution')
plt.xlabel('Survived (0 = No, 1 = Yes)')
plt.ylabel('Count')

# Add count and percentage labels
for i, count in enumerate(train_data['Survived'].value_counts()):
    percentage = 100 * count / len(train_data)
    plt.text(i, count + 10, f"{count} ({percentage:.1f}%)", ha='center')

plt.show()

## 5. Categorical Feature Analysis

In [None]:
# Function to plot survival rate by categorical feature
def plot_survival_by_category(feature, title=None, figsize=(12, 6)):
    plt.figure(figsize=figsize)
    
    # Plot count
    plt.subplot(1, 2, 1)
    sns.countplot(x=feature, hue='Survived', data=train_data, palette='Set2')
    plt.title(f'Count by {feature}')
    plt.xlabel(feature)
    plt.ylabel('Count')
    plt.legend(title='Survived', loc='upper right')
    
    # Plot survival rate
    plt.subplot(1, 2, 2)
    survival_rate = train_data.groupby(feature)['Survived'].mean().sort_values(ascending=False) * 100
    sns.barplot(x=survival_rate.index, y=survival_rate.values, palette='Set2')
    plt.title(f'Survival Rate by {feature}')
    plt.xlabel(feature)
    plt.ylabel('Survival Rate (%)')
    
    # Add percentage labels
    for i, rate in enumerate(survival_rate):
        plt.text(i, rate + 1, f"{rate:.1f}%", ha='center')
    
    plt.tight_layout()
    plt.show()
    
    # Return the survival rate data for reference
    return survival_rate

In [None]:
# Analyze survival by Sex
sex_survival = plot_survival_by_category('Sex', 'Survival Rate by Sex')
print(f"Survival rates by Sex:\n{sex_survival}")

In [None]:
# Analyze survival by Pclass
pclass_survival = plot_survival_by_category('Pclass', 'Survival Rate by Passenger Class')
print(f"Survival rates by Passenger Class:\n{pclass_survival}")

In [None]:
# Analyze survival by Embarked
embarked_survival = plot_survival_by_category('Embarked', 'Survival Rate by Port of Embarkation')
print(f"Survival rates by Port of Embarkation:\n{embarked_survival}")

In [None]:
# Combined analysis of Sex and Pclass
plt.figure(figsize=(12, 6))
sns.catplot(x='Pclass', hue='Survived', col='Sex', kind='count', data=train_data, palette='Set2', height=6, aspect=0.8)
plt.suptitle('Survival by Passenger Class and Sex', y=1.05, fontsize=16)
plt.tight_layout()
plt.show()

## 6. Numeric Feature Analysis

In [None]:
# Age distribution
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
sns.histplot(train_data['Age'].dropna(), kde=True, bins=30)
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Count')

plt.subplot(1, 2, 2)
sns.boxplot(y='Age', x='Survived', data=train_data, palette='Set2')
plt.title('Age by Survival Status')
plt.xlabel('Survived (0 = No, 1 = Yes)')
plt.ylabel('Age')

plt.tight_layout()
plt.show()

In [None]:
# Create age groups for analysis
train_data['AgeGroup'] = pd.cut(train_data['Age'], bins=[0, 12, 18, 35, 60, 100], labels=['Child', 'Teen', 'Young Adult', 'Adult', 'Senior'])

# Analyze survival by age group
age_group_survival = plot_survival_by_category('AgeGroup', 'Survival Rate by Age Group')
print(f"Survival rates by Age Group:\n{age_group_survival}")

In [None]:
# Fare distribution
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
sns.histplot(train_data['Fare'], kde=True, bins=30)
plt.title('Fare Distribution')
plt.xlabel('Fare')
plt.ylabel('Count')

plt.subplot(1, 2, 2)
sns.boxplot(y='Fare', x='Survived', data=train_data, palette='Set2')
plt.title('Fare by Survival Status')
plt.xlabel('Survived (0 = No, 1 = Yes)')
plt.ylabel('Fare')

plt.tight_layout()
plt.show()

In [None]:
# Create fare groups for analysis
train_data['FareGroup'] = pd.qcut(train_data['Fare'], 5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])

# Analyze survival by fare group
fare_group_survival = plot_survival_by_category('FareGroup', 'Survival Rate by Fare Group')
print(f"Survival rates by Fare Group:\n{fare_group_survival}")

## 7. Family Feature Analysis

In [None]:
# Create family size feature
train_data['FamilySize'] = train_data['SibSp'] + train_data['Parch'] + 1

# Analyze survival by family size
plt.figure(figsize=(12, 6))
family_survival = train_data.groupby('FamilySize')['Survived'].mean() * 100
family_counts = train_data['FamilySize'].value_counts()

plt.subplot(1, 2, 1)
sns.countplot(x='FamilySize', data=train_data, palette='viridis')
plt.title('Family Size Distribution')
plt.xlabel('Family Size')
plt.ylabel('Count')

plt.subplot(1, 2, 2)
sns.barplot(x=family_survival.index, y=family_survival.values, palette='viridis')
plt.title('Survival Rate by Family Size')
plt.xlabel('Family Size')
plt.ylabel('Survival Rate (%)')

# Add percentage labels
for i, rate in enumerate(family_survival):
    plt.text(i, rate + 1, f"{rate:.1f}%", ha='center')

plt.tight_layout()
plt.show()

print(f"Survival rates by Family Size:\n{family_survival}")
print(f"\nFamily Size counts:\n{family_counts}")

In [None]:
# Create family group feature
train_data['FamilyGroup'] = pd.cut(train_data['FamilySize'], bins=[0, 1, 4, float('inf')], labels=['Alone', 'Small', 'Large'])

# Analyze survival by family group
family_group_survival = plot_survival_by_category('FamilyGroup', 'Survival Rate by Family Group')
print(f"Survival rates by Family Group:\n{family_group_survival}")

## 8. Title Analysis

In [None]:
# Extract title from name
train_data['Title'] = train_data['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)

# View title counts
print("Title counts:")
title_counts = train_data['Title'].value_counts()
print(title_counts)

# Group rare titles
title_mapping = {
    'Mr': 'Mr',
    'Miss': 'Miss',
    'Mrs': 'Mrs',
    'Master': 'Master',
    'Dr': 'Rare',
    'Rev': 'Rare',
    'Col': 'Rare',
    'Major': 'Rare',
    'Mlle': 'Miss',
    'Countess': 'Rare',
    'Ms': 'Miss',
    'Lady': 'Rare',
    'Jonkheer': 'Rare',
    'Don': 'Rare',
    'Dona': 'Rare',
    'Mme': 'Mrs',
    'Capt': 'Rare',
    'Sir': 'Rare'
}

train_data['Title'] = train_data['Title'].map(title_mapping)

# Analyze survival by title
title_survival = plot_survival_by_category('Title', 'Survival Rate by Title')
print(f"\nSurvival rates by Title:\n{title_survival}")

## 9. Cabin Analysis

In [None]:
# Check if Cabin is available
train_data['HasCabin'] = train_data['Cabin'].notna().astype(int)

# Analyze survival by Cabin availability
cabin_survival = plot_survival_by_category('HasCabin', 'Survival Rate by Cabin Availability')
print(f"Survival rates by Cabin Availability (0 = No Cabin, 1 = Has Cabin):\n{cabin_survival}")

In [None]:
# Extract cabin deck (first letter)
train_data['Deck'] = train_data['Cabin'].str[0]

# Print deck counts
deck_counts = train_data['Deck'].value_counts()
print("Deck counts:")
print(deck_counts)

# Filter for decks with more than 10 passengers
common_decks = deck_counts[deck_counts > 10].index
deck_data = train_data[train_data['Deck'].isin(common_decks)].copy()

# Analyze survival by deck
plt.figure(figsize=(12, 6))
deck_survival = deck_data.groupby('Deck')['Survived'].mean().sort_values(ascending=False) * 100

sns.barplot(x=deck_survival.index, y=deck_survival.values, palette='viridis')
plt.title('Survival Rate by Deck')
plt.xlabel('Deck')
plt.ylabel('Survival Rate (%)')

# Add percentage labels
for i, rate in enumerate(deck_survival):
    plt.text(i, rate + 1, f"{rate:.1f}%", ha='center')

plt.tight_layout()
plt.show()

print(f"\nSurvival rates by Deck:\n{deck_survival}")

## 10. Correlation Analysis

In [None]:
# Prepare data for correlation analysis
corr_data = train_data.copy()

# Convert relevant categorical variables to numeric
corr_data['Sex'] = corr_data['Sex'].map({'male': 0, 'female': 1})
corr_data = pd.get_dummies(corr_data, columns=['Title', 'FamilyGroup', 'Embarked'], drop_first=False)

# Select numeric columns for correlation
numeric_cols = ['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 
                'FamilySize', 'HasCabin'] + \
               [col for col in corr_data.columns if col.startswith('Title_') or 
                col.startswith('FamilyGroup_') or col.startswith('Embarked_')]

# Calculate correlation with Survived
correlation_with_survived = corr_data[numeric_cols].corr()['Survived'].sort_values(ascending=False)
print("Correlation with Survived:")
print(correlation_with_survived)

In [None]:
# Plot correlation matrix of key features
key_features = ['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'FamilySize', 'HasCabin']
plt.figure(figsize=(12, 10))
correlation_matrix = corr_data[key_features].corr()

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of Key Features')
plt.tight_layout()
plt.show()

## 11. Key Insights Summary

Based on our exploratory analysis, here are the key insights:

1. **Gender**: Being female was the strongest predictor of survival. Women had a much higher survival rate compared to men.

2. **Class**: Passenger class (Pclass) was strongly correlated with survival. First class passengers had the highest survival rate, followed by second class, and third class had the lowest.

3. **Age**: Children had higher survival rates than adults. Age patterns varied by gender and class.

4. **Title**: Titles extracted from names showed clear patterns. "Mrs" and "Miss" had high survival rates, while "Mr" had a low survival rate.

5. **Family Size**: Passengers traveling with small families (2-4 members) had higher survival rates than those traveling alone or with large families.

6. **Fare**: Higher fare was associated with higher survival rate, which is also related to passenger class.

7. **Cabin**: Passengers with cabin information recorded had higher survival rates, suggesting they were higher-status passengers.

8. **Embarkation Port**: Passengers who embarked from Cherbourg (C) had higher survival rates than those from Queenstown (Q) or Southampton (S).

This analysis suggests that social status (class, fare, cabin) and demographic factors (gender, age) were key determinants of survival. These insights will be useful for feature engineering and model selection.

## 12. Next Steps

Based on our EDA, here are the next steps for our modeling process:

1. **Feature Engineering**:
   - Create title features from names
   - Create family size and family group features
   - Create cabin features (deck, has_cabin)
   - Create age groups
   - Create fare groups

2. **Data Preprocessing**:
   - Handle missing values based on our analysis
   - Encode categorical variables
   - Scale numerical features

3. **Feature Selection**:
   - Use the correlation analysis to guide feature selection
   - Consider feature importance from tree-based models

4. **Model Training**:
   - Try different models including logistic regression, random forest, gradient boosting
   - Optimize hyperparameters
   - Evaluate performance using cross-validation

5. **Model Evaluation**:
   - Compare model performance
   - Analyze feature importance
   - Generate predictions for the test set

These steps will help us build accurate predictive models for Titanic survival.