# Module 9: Capstone Data Analysis Project - Titanic Survival Analysis


## 1. Identifying a Real-World Data Problem


The sinking of the Titanic is one of the most infamous shipwrecks in history. In this project, we aim to analyze the factors that might have influenced the survival of the passengers. Questions we might ask include: 
- Were certain passenger classes more likely to survive?
- Did age or gender play a role in survival rates?
- How did the fare paid influence survival?

By answering these questions, we hope to gain insights into the tragedy and understand the various dynamics at play.


## 2. Data Collection and Cleaning


The Titanic dataset is publicly available and has been used extensively in the data science community. We'll start by loading this dataset and then proceed to clean and preprocess it to ensure its quality and relevance for our analysis.


In [None]:
import pandas as pd

# Load the Titanic dataset
df = pd.read_csv('https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv')

# Display the first few rows of the dataset to understand its structure
print("\nOriginal df.head()")
df.head()

In [None]:
# Display the dataset's basic information (data types, non-null counts, etc.)
print("\nBasic Information:")
df.info()

# Check for missing values
missing_values = df.isnull().sum()
print("Missing values for each column:")
print(missing_values)

# For the 'Age' column, we can fill missing values with the median age
if 'Age' in df.columns and df['Age'].isnull().sum() > 0:
    median_age = df['Age'].median()
    df['Age'].fillna(median_age, inplace=True)

# For categorical columns like 'Sex', if there are missing values, we can fill with the mode (most frequent value)
if 'Sex' in df.columns and df['Sex'].isnull().sum() > 0:
    mode_sex = df['Sex'].mode()[0]
    df['Sex'].fillna(mode_sex, inplace=True)

# Check for duplicates and remove them
if df.duplicated().sum() > 0:
    df.drop_duplicates(inplace=True)
    print("Duplicates removed!")

# Explore basic statistics of the dataset
df.describe()

# Check for outliers in 'Fare' using the IQR method
Q1 = df['Fare'].quantile(0.25)
Q3 = df['Fare'].quantile(0.75)
IQR = Q3 - Q1
fare_outliers = df[(df['Fare'] < (Q1 - 1.5 * IQR)) | (df['Fare'] > (Q3 + 1.5 * IQR))]
print("Number of outliers in 'Fare':", len(fare_outliers))

# Handle outliers if necessary, for example:
# df = df[(df['Fare'] >= (Q1 - 1.5 * IQR)) & (df['Fare'] <= (Q3 + 1.5 * IQR))]

# Convert categorical columns to numerical if needed for modeling; aka, 'label encoding'
# For example, converting 'Sex' to numerical values: 0 for 'male' and 1 for 'female'
# df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})

# Display the cleaned dataset
print("\nFinal df.head()")
df.head()

## 3. In-Depth Exploratory Data Analysis


With the cleaned data in hand, we'll now delve deep into exploratory data analysis (EDA) to uncover patterns, trends, and insights. EDA is a crucial step to understand the data's underlying structure and characteristics. We'll start by looking at some summary statistics and then dive deeper into each variable.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Summary statistics
df.describe()

In [None]:
# Exploring the distribution of passenger ages
plt.hist(df['Age'].dropna(), bins=30)
plt.title('Distribution of Passenger Ages')
plt.xlabel('Age')
plt.ylabel('Number of Passengers')
plt.show()

In [None]:
# Visualizing the distribution of passenger classes
sns.countplot(x='Pclass', data=df)
plt.title('Distribution of Passenger Classes')
plt.xlabel('Passenger Class')
plt.ylabel('Number of Passengers')
plt.show()

In [None]:
# Visualizing the distribution of genders
sns.countplot(x='Sex', data=df)
plt.title('Distribution of Genders')
plt.xlabel('Gender')
plt.ylabel('Number of Passengers')
plt.show()

In [None]:
# Box plot of ages by survival status
sns.boxplot(x='Survived', y='Age', data=df)
plt.title('Age Distribution by Survival Status')
plt.xlabel('Survived')
plt.ylabel('Age')
plt.show()

In [None]:
# Box plot of fares by survival status
sns.boxplot(x='Survived', y='Fare', data=df)
plt.title('Fare Distribution by Survival Status')
plt.xlabel('Survived')
plt.ylabel('Fare')
plt.ylim(0, 150)  # Limiting y-axis to better visualize the majority of data points
plt.show()

In [None]:
# Visualizing the distribution of number of siblings/spouses aboard
sns.countplot(x='Siblings/Spouses Aboard', data=df)
plt.title('Distribution of Number of Siblings/Spouses Aboard')
plt.xlabel('Number of Siblings/Spouses')
plt.ylabel('Number of Passengers')
plt.show()

In [None]:
# Visualizing the distribution of number of parents/children aboard
sns.countplot(x='Parents/Children Aboard', data=df)
plt.title('Distribution of Number of Parents/Children Aboard')
plt.xlabel('Number of Parents/Children')
plt.ylabel('Number of Passengers')
plt.show()

In [None]:
# Scatter plot of age vs. fare
sns.scatterplot(x='Age', y='Fare', hue='Survived', data=df)
plt.title('Scatter Plot of Age vs. Fare')
plt.xlabel('Age')
plt.ylabel('Fare')
plt.show()

## 4. Creating Comprehensive Data Visualizations


Visualizations play a crucial role in understanding the data and conveying insights. In this section, we'll create various visualizations to better understand the relationships between different variables and survival rates on the Titanic.


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df.groupby(['Pclass', 'Survived']).size().unstack()

In [None]:
df['Pclass'].value_counts()

In [None]:
df['Pclass'].value_counts().sort_index()

In [None]:
# Calculate the total number of passengers in each passenger class
total_passengers_by_class = df['Pclass'].value_counts().sort_index()

# Calculate the percentage of passengers who survived and did not survive in each class
survival_percentage_by_class = df.groupby(['Pclass', 'Survived']).size().unstack()
survival_percentage_by_class_percentage = survival_percentage_by_class.div(total_passengers_by_class, axis=0) * 100

survival_percentage_by_class_percentage

In [None]:
# Calculate the total number of passengers in each passenger class
total_passengers_by_class = df['Pclass'].value_counts().sort_index()

# Calculate the percentage of passengers who survived and did not survive in each class
survival_percentage_by_class = df.groupby(['Pclass', 'Survived']).size().unstack()
survival_percentage_by_class_percentage = survival_percentage_by_class.div(total_passengers_by_class, axis=0) * 100

# Create a grouped bar plot to visualize the survival percentages by passenger class
ax = survival_percentage_by_class_percentage.plot(kind='bar', stacked=True, figsize=(8, 6))
plt.title('Survival Percentage by Passenger Class')
plt.xlabel('Passenger Class')
plt.ylabel('Percentage')
plt.xticks(rotation=0)
plt.legend(title='Survived', labels=['Not Survived', 'Survived'])

# Add labels for each bar segment
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate(f'{height:.2f}%', (x + width/2, y + height/2), ha='center', va='center')

plt.show()

In [None]:
# Calculate survival proportions by gender
gender_survival = df.groupby(['Sex', 'Survived']).size().unstack()
gender_survival_percentage = gender_survival.div(gender_survival.sum(axis=1), axis=0) * 100

# Plot a stacked bar chart
ax = gender_survival_percentage.plot(kind='bar', stacked=True, figsize=(8, 6))

# Add labels and title
plt.xlabel('Gender')
plt.ylabel('Percentage')
plt.title('Survival Proportions by Gender')
plt.xticks(rotation=0)

# Annotate the bars with percentage values
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate(f'{height:.2f}%', (x + width/2, y + height/2), ha='center', va='center')

# Show the plot
plt.show()

In [None]:
import numpy as np

# Define age bins
age_bins = np.arange(0, 81, 10)

# Create a new column 'AgeGroup' to represent the age groups
df['AgeGroup'] = pd.cut(df['Age'], bins=age_bins, right=False, labels=[f'{start}-{start+9}' for start in age_bins[:-1]])

# Calculate survival proportions by age group
age_survival = df.groupby(['AgeGroup', 'Survived']).size().unstack()
age_survival_percentage = age_survival.div(age_survival.sum(axis=1), axis=0) * 100

# Plot a stacked bar chart
ax = age_survival_percentage.plot(kind='bar', stacked=True, figsize=(10, 6))

# Add labels and title
plt.xlabel('Age Group')
plt.ylabel('Percentage')
plt.title('Survival Proportions by Age Group (10-Year Intervals)')

# Annotate the bars with percentage values
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate(f'{height:.2f}%', (x + width/2, y + height/2), ha='center', va='center')

# Manually set x-axis tick labels
plt.xticks(range(len(age_survival_percentage.index)), age_survival_percentage.index)

# Show the plot
plt.xticks(rotation=45)  # Rotate x-axis labels for better visibility
plt.show()

In [None]:
# Calculate survival proportions by the number of siblings/spouses aboard
sibling_survival = df.groupby(['Siblings/Spouses Aboard', 'Survived']).size().unstack()
sibling_survival_percentage = sibling_survival.div(sibling_survival.sum(axis=1), axis=0) * 100

# Create a figure with two subplots (side by side)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Plot the stacked bar chart on the left subplot
ax1 = sibling_survival_percentage.plot(kind='bar', stacked=True, ax=ax1)
ax1.set_xlabel('Number of Siblings/Spouses Aboard')
ax1.set_ylabel('Percentage')
ax1.set_title('Survival Proportions by Number of Siblings/Spouses Aboard')

# Annotate the bars with percentage values
for p in ax1.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax1.annotate(f'{height:.2f}%', (x + width/2, y + height/2), ha='center', va='center')

# Plot a basic bar chart on the right subplot to show frequency
sibling_frequency = df['Siblings/Spouses Aboard'].value_counts().sort_index()
ax2 = sibling_frequency.plot(kind='bar', ax=ax2)
ax2.set_xlabel('Number of Siblings/Spouses Aboard')
ax2.set_ylabel('Frequency')
ax2.set_title('Frequency by Number of Siblings/Spouses Aboard')

# Rotate x-axis labels for better visibility
ax2.tick_params(axis='x', rotation=45)

# Show the combined plot
plt.tight_layout()
plt.show()

In [None]:
# Calculate survival percentages by grouping the data
survival_percentage_by_parch = df.groupby(['Parents/Children Aboard', 'Survived']).size().unstack()
total_passengers_by_parch = df['Parents/Children Aboard'].value_counts().sort_index()

# Calculate the survival percentage relative to the total passengers in each category
survival_percentage_by_parch_percentage = survival_percentage_by_parch.div(total_passengers_by_parch, axis=0) * 100

# Create a figure with two subplots (side by side)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Plot the stacked bar chart on the left subplot
ax1 = survival_percentage_by_parch_percentage.plot(kind='bar', stacked=True, ax=ax1)
ax1.set_title('Survival Rates by Number of Parents/Children Aboard')
ax1.set_xlabel('Number of Parents/Children Aboard')
ax1.set_ylabel('Percentage (%)')
ax1.legend(title='Survived', labels=['Not Survived', 'Survived'], loc='upper right')

# Annotate the bars with percentages
for p in ax1.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax1.annotate(f'{height:.2f}%', (x + width/2, y + height/2), ha='center', va='center')

# Plot a basic bar chart on the right subplot to show frequency
ax2 = total_passengers_by_parch.plot(kind='bar', ax=ax2)
ax2.set_title('Frequency by Number of Parents/Children Aboard')
ax2.set_xlabel('Number of Parents/Children Aboard')
ax2.set_ylabel('Frequency')

# Rotate x-axis labels for better visibility
ax2.tick_params(axis='x', rotation=45)

# Show the combined plot
plt.tight_layout()
plt.show()

## 5. Drawing Data-Driven Insights


After thorough analysis and visualization, we can draw several insights from the data. These insights will help us understand the factors that influenced the survival rates on the Titanic.


- **Insight 1:** First-class passengers had a higher survival rate compared to other classes.
- **Insight 2:** Women and children had a significantly higher survival rate than men, indicating a "women and children first" evacuation policy.
- **Insight 3:** Most passengers aboard the Titanic were in their late teens to early thirties, representing a young demographic of travelers.
- **Insight 4:** Passengers traveling with family (either siblings, spouses, parents, or children) had a higher survival rate than those traveling alone.
- **Insight 5:** Higher fare-paying passengers had a better survival rate, possibly correlating with passenger class and cabin location.


To conclude our analysis, it's essential to effectively communicate our findings. This can be done through a detailed report, a presentation deck, or even an interactive dashboard. Remember, the key to a successful analysis is not just finding insights but also conveying them in a manner that's easily understandable by both technical and non-technical audiences.
