# Titanic Dataset - Visual and Statistical Exploration

## Objective
Extract insights using visual and statistical exploration of the Titanic dataset to understand passenger survival patterns and relationships between variables.

## Dataset Overview
The Titanic dataset contains information about passengers aboard the RMS Titanic, including demographics, ticket information, and survival status.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set style for better visualizations
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)

## 1. Data Loading and Initial Exploration

In [None]:
# Load the dataset
df = pd.read_csv('train.csv')

print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
df.head()

In [None]:
# Basic information about the dataset
print("Dataset Info:")
df.info()

In [None]:
# Statistical summary
print("Statistical Summary:")
df.describe()

In [None]:
# Check for missing values
print("Missing Values:")
missing_data = df.isnull().sum()
missing_percent = (missing_data / len(df)) * 100
missing_df = pd.DataFrame({'Missing Count': missing_data, 'Percentage': missing_percent})
missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)

## 2. Categorical Variables Analysis

In [None]:
# Survival rate analysis
print("Survival Counts:")
survival_counts = df['Survived'].value_counts()
print(survival_counts)
print(f"\nSurvival Rate: {df['Survived'].mean():.2%}")

# Gender distribution
print("\nGender Distribution:")
print(df['Sex'].value_counts())

# Passenger class distribution
print("\nPassenger Class Distribution:")
print(df['Pclass'].value_counts().sort_index())

# Embarkation port distribution
print("\nEmbarkation Port Distribution:")
print(df['Embarked'].value_counts())

## 3. Survival Analysis by Categories

In [None]:
# Create subplots for categorical analysis
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Survival by Gender
survival_by_sex = df.groupby('Sex')['Survived'].agg(['count', 'sum', 'mean'])
survival_by_sex['survival_rate'] = survival_by_sex['mean']
axes[0,0].bar(survival_by_sex.index, survival_by_sex['survival_rate'], color=['lightcoral', 'lightblue'])
axes[0,0].set_title('Survival Rate by Gender')
axes[0,0].set_ylabel('Survival Rate')
for i, v in enumerate(survival_by_sex['survival_rate']):
    axes[0,0].text(i, v + 0.02, f'{v:.2%}', ha='center')

# Survival by Class
survival_by_class = df.groupby('Pclass')['Survived'].mean()
axes[0,1].bar(survival_by_class.index, survival_by_class.values, color=['gold', 'silver', 'brown'])
axes[0,1].set_title('Survival Rate by Passenger Class')
axes[0,1].set_xlabel('Passenger Class')
axes[0,1].set_ylabel('Survival Rate')
for i, v in enumerate(survival_by_class.values):
    axes[0,1].text(i, v + 0.02, f'{v:.2%}', ha='center')

# Survival by Embarkation Port
survival_by_embarked = df.groupby('Embarked')['Survived'].mean()
axes[1,0].bar(survival_by_embarked.index, survival_by_embarked.values, color=['green', 'orange', 'purple'])
axes[1,0].set_title('Survival Rate by Embarkation Port')
axes[1,0].set_xlabel('Embarkation Port')
axes[1,0].set_ylabel('Survival Rate')
for i, v in enumerate(survival_by_embarked.values):
    axes[1,0].text(i, v + 0.02, f'{v:.2%}', ha='center')

# Age distribution by survival
survived = df[df['Survived'] == 1]['Age'].dropna()
not_survived = df[df['Survived'] == 0]['Age'].dropna()
axes[1,1].hist([survived, not_survived], bins=20, alpha=0.7, label=['Survived', 'Not Survived'], color=['green', 'red'])
axes[1,1].set_title('Age Distribution by Survival')
axes[1,1].set_xlabel('Age')
axes[1,1].set_ylabel('Frequency')
axes[1,1].legend()

plt.tight_layout()
plt.show()

print("Key Observations:")
print(f"- Female survival rate: {survival_by_sex.loc['female', 'survival_rate']:.2%}")
print(f"- Male survival rate: {survival_by_sex.loc['male', 'survival_rate']:.2%}")
print(f"- First class survival rate: {survival_by_class[1]:.2%}")
print(f"- Third class survival rate: {survival_by_class[3]:.2%}")

## 4. Numerical Variables Analysis

In [None]:
# Numerical variables analysis
numerical_cols = ['Age', 'SibSp', 'Parch', 'Fare']

fig, axes = plt.subplots(2, 2, figsize=(15, 12))
axes = axes.ravel()

for i, col in enumerate(numerical_cols):
    # Box plot by survival status
    df.boxplot(column=col, by='Survived', ax=axes[i])
    axes[i].set_title(f'{col} Distribution by Survival Status')
    axes[i].set_xlabel('Survived')
    axes[i].set_ylabel(col)

plt.tight_layout()
plt.show()

# Statistical tests for numerical variables
print("Statistical Tests (t-test) for Numerical Variables:")
for col in numerical_cols:
    survived_data = df[df['Survived'] == 1][col].dropna()
    not_survived_data = df[df['Survived'] == 0][col].dropna()
    
    if len(survived_data) > 0 and len(not_survived_data) > 0:
        t_stat, p_value = stats.ttest_ind(survived_data, not_survived_data)
        print(f"{col}: t-statistic = {t_stat:.3f}, p-value = {p_value:.3f}")
        print(f"  Survived mean: {survived_data.mean():.2f}, Not survived mean: {not_survived_data.mean():.2f}")

## 5. Correlation Analysis

In [None]:
# Correlation heatmap
plt.figure(figsize=(10, 8))
correlation_cols = ['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
correlation_matrix = df[correlation_cols].corr()

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            square=True, linewidths=0.5)
plt.title('Correlation Matrix of Key Variables')
plt.tight_layout()
plt.show()

print("Key Correlations with Survival:")
survival_corr = correlation_matrix['Survived'].sort_values(key=abs, ascending=False)[1:]
for var, corr in survival_corr.items():
    print(f"{var}: {corr:.3f}")

## 6. Advanced Visualizations

In [None]:
# Pairplot for key numerical variables
plt.figure(figsize=(12, 10))
pairplot_data = df[['Survived', 'Age', 'Fare', 'SibSp', 'Parch']].dropna()
sns.pairplot(pairplot_data, hue='Survived', diag_kind='hist', plot_kws={'alpha': 0.6})
plt.suptitle('Pairplot of Numerical Variables by Survival Status', y=1.02)
plt.show()

In [None]:
# Multi-dimensional analysis
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Survival by Class and Gender
survival_class_sex = df.groupby(['Pclass', 'Sex'])['Survived'].mean().unstack()
survival_class_sex.plot(kind='bar', ax=axes[0,0], color=['lightcoral', 'lightblue'])
axes[0,0].set_title('Survival Rate by Class and Gender')
axes[0,0].set_ylabel('Survival Rate')
axes[0,0].legend(title='Gender')
axes[0,0].tick_params(axis='x', rotation=0)

# Age vs Fare colored by survival
scatter_data = df.dropna(subset=['Age', 'Fare'])
survived_scatter = scatter_data[scatter_data['Survived'] == 1]
not_survived_scatter = scatter_data[scatter_data['Survived'] == 0]
axes[0,1].scatter(not_survived_scatter['Age'], not_survived_scatter['Fare'], 
                 alpha=0.6, c='red', label='Not Survived', s=30)
axes[0,1].scatter(survived_scatter['Age'], survived_scatter['Fare'], 
                 alpha=0.6, c='green', label='Survived', s=30)
axes[0,1].set_xlabel('Age')
axes[0,1].set_ylabel('Fare')
axes[0,1].set_title('Age vs Fare by Survival Status')
axes[0,1].legend()

# Family size analysis
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
family_survival = df.groupby('FamilySize')['Survived'].agg(['count', 'mean'])
axes[1,0].bar(family_survival.index, family_survival['mean'], color='skyblue')
axes[1,0].set_xlabel('Family Size')
axes[1,0].set_ylabel('Survival Rate')
axes[1,0].set_title('Survival Rate by Family Size')
for i, v in enumerate(family_survival['mean']):
    axes[1,0].text(family_survival.index[i], v + 0.02, f'{v:.2f}', ha='center')

# Fare distribution by class
for pclass in [1, 2, 3]:
    class_fares = df[df['Pclass'] == pclass]['Fare'].dropna()
    axes[1,1].hist(class_fares, alpha=0.7, label=f'Class {pclass}', bins=20)
axes[1,1].set_xlabel('Fare')
axes[1,1].set_ylabel('Frequency')
axes[1,1].set_title('Fare Distribution by Passenger Class')
axes[1,1].legend()
axes[1,1].set_xlim(0, 200)  # Limit x-axis for better visualization

plt.tight_layout()
plt.show()

print("Family Size Analysis:")
print(family_survival)

## 7. Age Group Analysis

In [None]:
# Create age groups
df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 12, 18, 35, 60, 100], 
                       labels=['Child', 'Teen', 'Young Adult', 'Adult', 'Senior'])

# Age group survival analysis
age_group_survival = df.groupby('AgeGroup')['Survived'].agg(['count', 'sum', 'mean'])
age_group_survival.columns = ['Total', 'Survived', 'Survival_Rate']

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Survival rate by age group
age_group_survival['Survival_Rate'].plot(kind='bar', ax=ax1, color='lightgreen')
ax1.set_title('Survival Rate by Age Group')
ax1.set_ylabel('Survival Rate')
ax1.tick_params(axis='x', rotation=45)
for i, v in enumerate(age_group_survival['Survival_Rate']):
    ax1.text(i, v + 0.02, f'{v:.2%}', ha='center')

# Count by age group
age_group_survival[['Total', 'Survived']].plot(kind='bar', ax=ax2)
ax2.set_title('Passenger Count by Age Group')
ax2.set_ylabel('Count')
ax2.tick_params(axis='x', rotation=45)
ax2.legend(['Total', 'Survived'])

plt.tight_layout()
plt.show()

print("Age Group Analysis:")
print(age_group_survival)

## 8. Summary of Key Findings

In [None]:
# Summary statistics
print("=" * 60)
print("TITANIC DATASET - KEY FINDINGS SUMMARY")
print("=" * 60)

print(f"\n1. OVERALL SURVIVAL:")
print(f"   - Total passengers: {len(df)}")
print(f"   - Survivors: {df['Survived'].sum()} ({df['Survived'].mean():.1%})")
print(f"   - Non-survivors: {len(df) - df['Survived'].sum()} ({1 - df['Survived'].mean():.1%})")

print(f"\n2. GENDER IMPACT:")
female_survival = df[df['Sex'] == 'female']['Survived'].mean()
male_survival = df[df['Sex'] == 'male']['Survived'].mean()
print(f"   - Female survival rate: {female_survival:.1%}")
print(f"   - Male survival rate: {male_survival:.1%}")
print(f"   - Gender survival gap: {female_survival - male_survival:.1%}")

print(f"\n3. CLASS IMPACT:")
for pclass in [1, 2, 3]:
    class_survival = df[df['Pclass'] == pclass]['Survived'].mean()
    print(f"   - Class {pclass} survival rate: {class_survival:.1%}")

print(f"\n4. AGE IMPACT:")
child_survival = df[df['Age'] <= 12]['Survived'].mean()
adult_survival = df[df['Age'] > 12]['Survived'].mean()
print(f"   - Children (≤12) survival rate: {child_survival:.1%}")
print(f"   - Adults (>12) survival rate: {adult_survival:.1%}")

print(f"\n5. FAMILY SIZE IMPACT:")
alone_survival = df[df['FamilySize'] == 1]['Survived'].mean()
small_family_survival = df[(df['FamilySize'] >= 2) & (df['FamilySize'] <= 4)]['Survived'].mean()
large_family_survival = df[df['FamilySize'] > 4]['Survived'].mean()
print(f"   - Traveling alone: {alone_survival:.1%}")
print(f"   - Small family (2-4): {small_family_survival:.1%}")
print(f"   - Large family (>4): {large_family_survival:.1%}")

print(f"\n6. FARE IMPACT:")
high_fare_survival = df[df['Fare'] > df['Fare'].median()]['Survived'].mean()
low_fare_survival = df[df['Fare'] <= df['Fare'].median()]['Survived'].mean()
print(f"   - High fare (above median): {high_fare_survival:.1%}")
print(f"   - Low fare (below median): {low_fare_survival:.1%}")

print(f"\n7. DATA QUALITY:")
print(f"   - Missing Age values: {df['Age'].isnull().sum()} ({df['Age'].isnull().mean():.1%})")
print(f"   - Missing Cabin values: {df['Cabin'].isnull().sum()} ({df['Cabin'].isnull().mean():.1%})")
print(f"   - Missing Embarked values: {df['Embarked'].isnull().sum()}")

print("\n" + "=" * 60)

## 9. Conclusions and Insights

### Key Insights from the Analysis:

1. **Gender was the strongest predictor of survival**: Women had a significantly higher survival rate (~74%) compared to men (~19%), reflecting the "women and children first" evacuation protocol.

2. **Passenger class strongly influenced survival**: First-class passengers had the highest survival rate (~63%), followed by second-class (~47%) and third-class (~24%), indicating socioeconomic disparities in survival chances.

3. **Age played a protective role for children**: Children (≤12 years) had higher survival rates than adults, supporting the "women and children first" principle.

4. **Family size had a complex relationship with survival**: Passengers traveling alone or in very large families had lower survival rates compared to those in small to medium-sized families (2-4 members).

5. **Fare correlated with survival**: Higher fare passengers (proxy for wealth/class) had better survival chances, reinforcing the class-based survival pattern.

6. **Embarkation port showed some variation**: Passengers who embarked at Cherbourg (C) had slightly higher survival rates, possibly due to higher proportion of first-class passengers.

### Statistical Significance:
- Most relationships showed statistical significance (p < 0.05) in t-tests
- Strong correlations were observed between survival and class (-0.34), fare (0.26), and gender

### Data Quality Notes:
- Age data was missing for ~20% of passengers
- Cabin information was missing for ~77% of passengers
- These missing values may introduce bias in age-related analyses

This analysis reveals that survival on the Titanic was not random but was significantly influenced by social factors including gender, class, and age, reflecting the social norms and evacuation procedures of the early 20th century.