
# Titanic Survival Analysis

This project analyzes the Titanic dataset to explore key factors affecting passenger survival.

We will cover:
- Data Loading & Cleaning
- Exploratory Data Analysis (EDA)
- Survival Rates by Various Features
- Key Insights

---


## 1. Data Loading

In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Read the Excel file (assuming single column string, as original)
df_raw = pd.read_excel('Titanic.xlsx', header=None)

# Split into columns
df_split = df_raw[0].str.split(',', expand=True)
df_split.columns = df_split.iloc[0]
df_clean = df_split.drop(index=0).reset_index(drop=True)

# Clean column names
df_clean.columns = df_clean.columns.str.strip().str.lower()

# Convert relevant columns to numeric
cols_to_numeric = ['survived', 'pclass', 'age', 'sibsp', 'parch', 'fare']
for col in cols_to_numeric:
    df_clean[col] = pd.to_numeric(df_clean[col], errors='coerce')

# Preview data
df_clean.head()


## 2. Data Cleaning

In [None]:

# Check for missing values
df_clean.isnull().sum()


## 3. Exploratory Data Analysis (EDA)

### 3.1 Gender Distribution

In [None]:

gender_counts = df_clean['sex'].value_counts()
print(gender_counts)

# Plot gender distribution
gender_counts.plot(kind='bar', color=['skyblue', 'salmon'])
plt.title('Gender Distribution')
plt.xlabel('Gender')
plt.ylabel('Number of Passengers')
plt.xticks(rotation=0)
plt.show()


### 3.2 Age Distribution

In [None]:

# Summary statistics for age
print(df_clean['age'].describe())

# Plot age distribution
plt.figure(figsize=(8,5))
sns.histplot(df_clean['age'], bins=20, kde=True)
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()


### 3.3 Embarkation Distribution

In [None]:

embark_counts = df_clean['embark_town'].value_counts()
print(embark_counts)

embark_counts.plot(kind='bar', color='orange')
plt.title('Passengers by Embarkation Town')
plt.xlabel('Embarkation Town')
plt.ylabel('Number of Passengers')
plt.xticks(rotation=15)
plt.show()


### 3.4 Survival Rate by Gender

In [None]:

survival_by_gender = df_clean.groupby('sex')['survived'].mean()
print((survival_by_gender * 100).round(2))

# Visualize
survival_by_gender.plot(kind='bar', color=['salmon', 'skyblue'])
plt.title('Survival Rate by Gender')
plt.xlabel('Gender')
plt.ylabel('Survival Rate')
plt.ylim(0, 1)
plt.xticks(rotation=0)
plt.show()


### 3.5 Survival Rate by Passenger Class

In [None]:

survival_by_class = df_clean.groupby('pclass')['survived'].mean()
print((survival_by_class * 100).round(2))

# Visualize
survival_by_class.plot(kind='bar', color='green')
plt.title('Survival Rate by Class')
plt.xlabel('Passenger Class')
plt.ylabel('Survival Rate')
plt.ylim(0, 1)
plt.xticks(rotation=0)
plt.show()


### 3.6 Average Fare Paid per Class

In [None]:

fare_by_class = df_clean.groupby('pclass')['fare'].mean()
print(fare_by_class.round(2))

fare_by_class.plot(kind='bar', color='purple')
plt.title('Average Fare per Class')
plt.xlabel('Passenger Class')
plt.ylabel('Average Fare')
plt.xticks(rotation=0)
plt.show()


### 3.7 Correlation between Fare and Survival

In [None]:

correlation = df_clean['fare'].corr(df_clean['survived'])
print(f"Correlation between fare and survival: {correlation:.2f}")

# Visualize
sns.boxplot(x='survived', y='fare', data=df_clean)
plt.title('Fare vs Survival')
plt.xlabel('Survived (0 = No, 1 = Yes)')
plt.ylabel('Fare')
plt.show()


## 4. Key Findings


- Majority of passengers were male (~65%).
- Median age of passengers was around 28 years.
- Most passengers embarked from Southampton.
- Females had a much higher survival rate (~74%) than males (~19%).
- 1st class passengers had the highest survival rate (~62%).
- Passengers in higher classes paid significantly higher fares.
- A small positive correlation exists between fare paid and survival.

---

This completes the Titanic exploratory data analysis.
