<a href="https://colab.research.google.com/github/Amina5sep/Titanic-Survival-Prediction-Model/blob/main/chapter_appendix-tools-for-deep-learning/jupyter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧭 Mini Case Study — *“Survival on the Titanic: An Exploratory Data Adventure”*

---

## 🎯 Objective
Suppose you are working as a **Data Analyst** to investigate the historic **Titanic dataset**.  
Your mission: explore, clean, and visualize the data to uncover **who was more likely to survive — and why**.  

By the end of this case study, you will be able to:  
- 🔍 Explore and describe the dataset  
- 🧹 Handle missing values  
- 📊 Analyze categorical and numerical variables  
- 🎨 Visualize patterns and relationships  
- 🧠 Interpret insights through storytelling  

In [25]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
titanic = sns.load_dataset("titanic")
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## 🚢 **Section A — Getting to Know the Ship’s Passengers**

Before diving deep, let’s understand **who boarded the Titanic** and what kind of information we have in the dataset.

---

### 🧩 Tasks

1. Display the **first five rows** of the dataset.  
2. Check the **total number of passengers** and **columns**.  
3. Use `.info()` and `.describe()` to summarize **data types** and **numeric statistics**.  
4. Identify columns with **missing values** — which features might need cleaning?  

---

### 🧾 Observation Summary

After completing the above tasks, **write a short summary (3–4 lines)** discussing your initial understanding of the dataset:  

- How many passengers and features are present?  
- What types of variables (categorical, numerical) do you notice?  
- Are there any missing or incomplete values?  
- What does your first impression of the data tell you about the passengers?




In [None]:
titanic.head(5)

In [None]:
titanic.count()

In [None]:
titanic.info()

In [None]:
titanic.describe()

In [None]:
titanic.isnull().sum()

The Titanic dataset contains 891 passengers and 12 features.
It includes both numerical variables (like Age, Fare, SibSp, Parch) and categorical variables (like Sex, Embarked, Pclass).
Some features — especially Age, Cabin, and Embarked — have missing values that need cleaning.
Overall, the data provides a mix of demographic, social, and ticket-related information that can help us understand which passengers were more likely to survive.

## 💾 **Section B — Fixing the Passenger Log**

Some parts of the passenger log were **damaged during the disaster**.  
Your task is to **handle missing values** to ensure data accuracy before moving ahead with the analysis.

---

### 🧩 Tasks

1. Count **missing values** in each column.  
2. Replace missing `age` values with the **mean age**.  
3. Fill missing `embarked` entries with the **most frequent port** (*mode*).  
4. Drop the `deck` column if it has **too many missing entries**.  
5. Write one line explaining **why handling missing data** is important before analysis.  

---

### 🧾 Observation Summary

After completing the above tasks, **write a short summary (3–4 lines)** describing what you observed and learned:  

- Which columns had missing values?  
- What strategies did you use to fill or remove them?  
- How did handling missing data change the completeness of your dataset?  
- Why is data cleaning an essential step in EDA?


In [None]:
mean_age = titanic['age'].mean()
titanic['age'] = titanic['age'].fillna(mean_age)

In [None]:
mode_embarked = titanic['embarked'].mode()[0]
titanic['embarked'] = titanic['embarked'].fillna(mode_embarked)

In [None]:
if titanic['deck'].isnull().mean() > 0.5:
    titanic.drop(columns=['deck'], inplace=True)

In [None]:
print("\n Missing values after cleaning:")
print(titanic.isnull().sum())

Handling missing data ensures accurate, unbiased results for analysis and modeling.

## 👩‍👦 **Section C — Who Were Onboard?**

Let’s explore the **demographics — class, gender, and age** — to understand who boarded the Titanic.  
This section will help us identify how passengers were distributed across different classes and age groups.

---

### 🧩 Tasks

- Identify which **class** had the most and fewest passengers.  
- Explore what **age group** (children, young adults, or older adults) appears most frequently.  
- Think about why there might be **more passengers in one class** than another.  

---

### 💡 Hint: Use the following code
```python
# Visualization: Passenger class distribution
sns.countplot(data=titanic, x='class', palette='pastel')
plt.title("Passenger Class Distribution")
plt.xlabel("Passenger Class")
plt.ylabel("Count")
plt.show()

# Age distribution of passengers
sns.histplot(titanic['age'], bins=30, kde=True, color='skyblue')
plt.title("Distribution of Passenger Ages")
plt.xlabel("Age")
plt.ylabel("Count")
plt.show()


🧾 **Observation Summary**

After completing the above tasks, write a short summary (3–4 lines) describing your observations:

- Which passenger class had the highest and lowest counts?

- What does the age distribution tell you about who was on board?

- Do you notice any interesting trends between class and age?

In [None]:
# 1 Passenger Class Distribution
sns.countplot(data=titanic, x='class', palette='pastel')
plt.title("Passenger Class Distribution")
plt.xlabel("Passenger Class")
plt.ylabel("Count")
plt.show()

In [None]:
# 2 Age Distribution
sns.histplot(titanic['age'], bins=30, kde=True, color='skyblue')
plt.title("Distribution of Passenger Ages")
plt.xlabel("Age")
plt.ylabel("Count")
plt.show()

In [None]:
# 3 Optional: Combine both insights
sns.boxplot(data=titanic, x='class', y='age', palette='Set2')
plt.title("Age Distribution Across Passenger Classes")
plt.xlabel("Passenger Class")
plt.ylabel("Age")
plt.show()


Most passengers traveled in Third Class, while First Class had the fewest passengers.
The age distribution shows that the majority of passengers were young adults (around 20–35 years old), with fewer children and elderly people.
The First Class passengers tended to be slightly older on average than those in Third Class, possibly reflecting socio-economic differences — wealthier, older travelers could afford higher-class tickets.

## 💰 Section D — Exploring Passenger Wealth

The Titanic had passengers from all walks of life.  
Let’s explore how **ticket prices (fares)** varied among different passenger classes and genders.

---

### 💡 Hint: Use the following code
```python
# Visualize fare distribution across classes
sns.boxplot(data=titanic, x='class', y='fare', palette='Set2')
plt.title("Fare Distribution by Class")
plt.show()

# Compare fares by class and gender
sns.boxplot(data=titanic, x='class', y='fare', hue='sex', palette='coolwarm')
plt.title("Fare by Class and Gender")
plt.show()

🧩 **Questions**

- Which passenger class paid the highest fares?

- Do you observe any outliers (unusually high fares)?

- What differences do you notice between male and female fares?

- What might explain these differences (e.g., class, cabin type, group size)?

In [None]:
# 1 Visualize fare distribution across classes
sns.boxplot(data=titanic, x='class', y='fare', palette='Set2')
plt.title("Fare Distribution by Class")
plt.xlabel("Passenger Class")
plt.ylabel("Fare")
plt.show()

In [None]:
# 2 Compare fares by class and gender
sns.boxplot(data=titanic, x='class', y='fare', hue='sex', palette='coolwarm')
plt.title("Fare by Class and Gender")
plt.xlabel("Passenger Class")
plt.ylabel("Fare")
plt.show()

Which passenger class paid the highest fares?

➤ First Class passengers paid the highest fares, while Third Class paid the least.

Do you observe any outliers (unusually high fares)?

➤ Yes — several outliers appear in the First Class boxplot, representing passengers who paid very expensive fares, possibly for private cabins or luxury suites.

What differences do you notice between male and female fares?

➤ Within each class, female passengers often show slightly higher median fares than males, especially in First Class.

What might explain these differences?

➤ The differences likely stem from ticket type, cabin location, or group/family bookings. Wealthier families and women traveling with companions might have purchased better accommodations or shared expensive cabins, increasing the average fare.

## 🧍‍♀️ **Section E — The Fate of the Passengers**

Suppose you are working as a **Data Analyst** to investigate the historic **Titanic dataset**.  
Your mission: explore, clean, and visualize the data to uncover **who was more likely to survive — and why**.  

By the end of this section, you will be able to:  
- Analyze **survival rates** based on gender and class.  
- Visualize **patterns of survival** using count and bar plots.  
- Interpret the findings in a **storytelling format** that reflects real-world insights.

---

### 💡 Hint: Use the following code

# Survival count by gender
sns.countplot(data=titanic, x='sex', hue='survived', palette='coolwarm')
plt.title("Survival by Gender")
plt.xlabel("Gender")
plt.ylabel("Count")
plt.legend(title="Survived", labels=["No", "Yes"])
plt.show()

# Cross-tabulation of class and survival (percentage within each class)
pd.crosstab(titanic['class'], titanic['survived'], normalize='index') * 100

# Grouped bar chart: Survival by class and gender
sns.barplot(data=titanic, x='class', y='survived', hue='sex', palette='viridis')
plt.title("Survival Rate by Class and Gender")
plt.ylabel("Survival Rate")
plt.show()


### **Questions**

- Which **gender** had a higher survival rate?  
- Which **class** had the lowest chance of survival?  
- How do **class and gender together** affect survival chances?  
- What might explain these trends **historically or socially** (e.g., “Women and children first” policy)?  

---

### 🧾 **Observation Summary**

After answering the above questions, **write a short summary (3–4 lines)** discussing your findings:  

- How does survival differ by **gender**?  
- How does survival differ by **class**?  
- What **social or historical factors** might explain these patterns?

In [None]:
# 1 Survival count by gender
sns.countplot(data=titanic, x='sex', hue='survived', palette='coolwarm')
plt.title("Survival by Gender")
plt.xlabel("Gender")
plt.ylabel("Count")
plt.legend(title="Survived", labels=["No", "Yes"])
plt.show()


In [None]:
# 2 Cross-tabulation of class and survival (in %)
survival_by_class = pd.crosstab(titanic['class'], titanic['survived'], normalize='index') * 100
print(" Survival Percentage by Class:\n")
print(survival_by_class.round(2))

In [None]:
# 3 Grouped bar chart: Survival by class and gender
sns.barplot(data=titanic, x='class', y='survived', hue='sex', palette='viridis')
plt.title("Survival Rate by Class and Gender")
plt.xlabel("Passenger Class")
plt.ylabel("Survival Rate")
plt.show()

Q.Which gender had a higher survival rate?

➤ Females had a much higher survival rate than males.

Q.Which class had the lowest chance of survival?

➤ Third Class passengers had the lowest survival rate, while First Class had the highest.

Q.How do class and gender together affect survival chances?

➤ First Class women had the best chance of survival, while Third Class men had the worst.

Q.What might explain these trends historically or socially?

➤ The difference reflects the “Women and children first” policy followed during the disaster and the fact that wealthier passengers (First Class) had easier access to lifeboats due to their cabin proximity and crew assistance.

Observation Summary

The analysis shows that women were far more likely to survive than men, confirming the “Women and children first” evacuation rule.
Survival rates were also highest in First Class and lowest in Third Class, highlighting the impact of social status and cabin location on safety.
Combining both factors, First Class females had the greatest survival advantage.
These trends clearly reveal how gender and socio-economic class influenced survival outcomes during the Titanic disaster.

In [27]:
from google.colab import drive
drive.mount('/content/drive')

MessageError: Error: credential propagation was unsuccessful