# Exploratory Data Analysis on Titanic Dataset

In this notebook, we will perform exploratory data analysis (EDA) on the Titanic dataset. The goal is to understand the dataset better, visualize relationships between variables, and extract meaningful insights.

In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Set display options
pd.set_option('display.max_columns', None)
sns.set(style='whitegrid')

## Load the Data
Let's load the Titanic dataset to begin our analysis.

In [2]:
# Load the dataset
data = pd.read_csv('../data/raw/titanic.csv')

# Display the first few rows of the dataset
data.head()

## Data Overview
Let's get an overview of the dataset, including its shape and data types.

In [3]:
# Check the shape of the dataset
data.shape

(891, 12)

In [4]:
# Check the data types and null values
data.info()

## Descriptive Statistics
Let's generate descriptive statistics to understand the distribution of numerical features.

In [5]:
# Descriptive statistics
data.describe()

## Data Visualization
Now, let's visualize the data to understand relationships and distributions.

### Distribution of Age
We will first look at the distribution of the 'Age' feature.

In [6]:
# Visualize the age distribution
plt.figure(figsize=(10, 6))
sns.histplot(data['Age'], bins=30, kde=True)
plt.title('Age Distribution of Passengers')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

<AxesSubplot:>

### Survival Rate by Gender
Next, we will examine the survival rate based on gender.

In [7]:
# Survival rate by gender
plt.figure(figsize=(8, 5))
sns.barplot(x='Sex', y='Survived', data=data, ci=None)
plt.title('Survival Rate by Gender')
plt.ylabel('Survival Rate')
plt.xlabel('Gender')
plt.show()

<AxesSubplot:xlabel='Sex', ylabel='Survival Rate'>

### Survival Rate by Class
We will now explore the survival rate based on the passenger class (Pclass).

In [8]:
# Survival rate by class
plt.figure(figsize=(8, 5))
sns.barplot(x='Pclass', y='Survived', data=data, ci=None)
plt.title('Survival Rate by Passenger Class')
plt.ylabel('Survival Rate')
plt.xlabel('Passenger Class')
plt.show()

<AxesSubplot:xlabel='Pclass', ylabel='Survival Rate'>

### Correlation Heatmap
Finally, let's visualize the correlation between numeric features using a heatmap.

In [9]:
# Correlation heatmap
plt.figure(figsize=(12, 8))
corr = data.corr()
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', square=True)
plt.title('Correlation Heatmap')
plt.show()

<AxesSubplot:>

## Key Insights
1. Females had a higher survival rate compared to males.
2. Passengers in first class had the highest survival rates.
3. Age appears to have a varied distribution with some outliers.

This concludes the exploratory data analysis. Further steps include data preprocessing and model training.