# Titanic Dataset - Exploratory Data Analysis (EDA)

This notebook performs an exploratory data analysis on the Titanic dataset. Steps include data loading, cleaning, visualization, feature engineering, and summary of insights.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
sns.set(style='whitegrid')

## 1. Load the Titanic Dataset and Inspect Structure

In [None]:
# Load the dataset
data = pd.read_csv('Titanic-Dataset.csv')

# Display first 5 rows
data.head()

In [None]:
# Get info on the dataset
data.info()

In [None]:
# Summary statistics
data.describe(include='all')

In [None]:
# Check missing values
data.isnull().sum()

## 2. Data Cleaning and Handling Missing Values

In [None]:
# Handle missing Age values by filling with median age
data['Age'].fillna(data['Age'].median(), inplace=True)

# Cabin has many missing values; create a new feature 'HasCabin' indicating presence of cabin info
data['HasCabin'] = data['Cabin'].apply(lambda x: 0 if pd.isnull(x) else 1)

# Fill missing Embarked values with the mode
data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)

# Verify missing values handled
data.isnull().sum()

## 3. Exploratory Data Analysis (Visualizations)

In [None]:
# Survived vs Sex count plot
plt.figure(figsize=(8,5))
sns.countplot(x='Sex', hue='Survived', data=data)
plt.title('Survival Count by Sex')
plt.show()

In [None]:
# Survived vs Pclass count plot
plt.figure(figsize=(8,5))
sns.countplot(x='Pclass', hue='Survived', data=data)
plt.title('Survival Count by Passenger Class')
plt.show()

In [None]:
# Age distribution by Survival status
plt.figure(figsize=(10,6))
sns.kdeplot(data.loc[data['Survived'] == 0, 'Age'], label='Did Not Survive', shade=True)
sns.kdeplot(data.loc[data['Survived'] == 1, 'Age'], label='Survived', shade=True)
plt.title('Age Distribution by Survival Status')
plt.xlabel('Age')
plt.show()

In [None]:
# Fare distribution by Survival status
plt.figure(figsize=(10,6))
sns.kdeplot(data.loc[data['Survived'] == 0, 'Fare'], label='Did Not Survive', shade=True)
sns.kdeplot(data.loc[data['Survived'] == 1, 'Fare'], label='Survived', shade=True)
plt.title('Fare Distribution by Survival Status')
plt.xlabel('Fare')
plt.show()

In [None]:
# Survival count by Embarked port
plt.figure(figsize=(8,5))
sns.countplot(x='Embarked', hue='Survived', data=data)
plt.title('Survival Count by Embarked Port')
plt.show()

## 4. Additional Feature Analysis

In [None]:
# Create FamilySize feature from SibSp and Parch
data['FamilySize'] = data['SibSp'] + data['Parch'] + 1

# Extract Title from Name
data['Title'] = data['Name'].str.extract('([A-Za-z]+)\.', expand=False)

# Display countplot of Survived by Title
plt.figure(figsize=(12,6))
sns.countplot(data=data, x='Title', hue='Survived', order=data['Title'].value_counts().index)
plt.title('Survival Count by Title')
plt.xticks(rotation=45)
plt.show()

## 5. Summary and Insights

Based on the exploratory data analysis:
- Females had a higher survival rate than males.
- First and second class passengers had better survival compared to third class.
- Younger passengers had a higher chance of survival.
- Passengers paying higher fare generally survived more.
- Embarkation location showed some influence on survival rates.
- Title and family size added insightful features correlated with survival.

This analysis provides a base for further modeling or deeper feature engineering.