[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DrFranData/PfDA/blob/main/Topic7.ipynb)
 

# Topic 7 - Data Analysis with Python

In this section, we'll perform **exploratory data analysis (EDA)** and **statistical analysis** using Python to gain insights into the Titanic dataset. This stage of the data science workflow helps us better understand the variables, identify trends and relationships, and form hypotheses that can guide future modeling or deeper investigation.

We'll make use of Python libraries such as `pandas`, `matplotlib`, and `seaborn` to:
- Explore data distributions
- Investigate relationships between features
- Answer data-driven questions
- Support our findings with visualizations and summary statistics

## Recap: Data Cleaning

Before diving into analysis, it’s important to ensure that the dataset has been properly cleaned. Here's a brief recap of the data cleansing steps we performed earlier:

1. **Missing values** were addressed in several columns:
   - `cabin`: Filled with the string `'None'` to indicate missing cabin info.
   - `age`: Filled with the **mean age** of all passengers.
   - `embarked`: Filled with the **most common port** (mode).
   - `fare`: Imputed with the **median fare** for passengers of the same class and embarkation point.

2. **Feature Engineering** included:
   - Extracting the `title` (e.g., Mr, Miss, etc.) from the `name` column.
   - Standardizing rare and variant titles into broader categories.

3. **Sanity checks**:
   - Verified that there are no remaining missing values in the dataset.
   
With these preprocessing steps complete, we are now ready to begin exploring the data in more depth.

## Univariate Analysis

In this section, we’ll examine individual features in the Titanic dataset to understand their distributions and characteristics. We’ll use a mix of statistical summaries and visualizations to answer key questions.

### What is the distribution of passengers' ages?

Understanding the age distribution helps us learn about the demographic of Titanic passengers. It also allows us to identify patterns such as age groups that are over- or under-represented.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set a consistent style
sns.set(style='whitegrid')

# Plot the age distribution
plt.figure(figsize=(10, 6))
sns.histplot(titanic['age'], bins=30, kde=True, color='skyblue')
plt.title('Distribution of Passenger Ages')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()

In [None]:
# Basic statistics for age
titanic['age'].describe()

### How were passengers distributed by travel class?

The `pclass` feature indicates the socio-economic status of a passenger (1 = upper, 2 = middle, 3 = lower). Let's see how many passengers were in each class.

In [None]:
# Countplot for travel class
plt.figure(figsize=(8, 5))
sns.countplot(x='pclass', data=titanic, palette='Set2')
plt.title('Passenger Count by Class')
plt.xlabel('Passenger Class')
plt.ylabel('Count')
plt.show()

In [None]:
# Percentage breakdown by class
titanic['pclass'].value_counts(normalize=True).mul(100).round(2)

### What were the most common embarkation points?

Passengers boarded the Titanic from three ports: Cherbourg (C), Queenstown (Q), and Southampton (S). Let’s see how the passengers were distributed across these ports.

In [None]:
# Countplot for embarkation ports
plt.figure(figsize=(8, 5))
sns.countplot(x='embarked', data=titanic, palette='pastel')
plt.title('Passenger Count by Embarkation Port')
plt.xlabel('Embarkation Port')
plt.ylabel('Count')
plt.show()

In [None]:
# Relative frequencies
titanic['embarked'].value_counts(normalize=True).mul(100).round(2)

### How is fare distributed among passengers?

The `fare` column indicates how much each passenger paid for their ticket. Let’s look at the distribution and see if there are any extreme values or skewness.

In [None]:
# Fare distribution
plt.figure(figsize=(10, 6))
sns.histplot(titanic['fare'], bins=40, kde=True, color='lightgreen')
plt.title('Distribution of Fare Prices')
plt.xlabel('Fare')
plt.ylabel('Count')
plt.xlim(0, 300)  # Remove extreme outliers for better visibility
plt.show()

In [None]:
# Summary statistics for fare
titanic['fare'].describe()

## Multivariate Analysis

So far, we have explored the Titanic dataset using **univariate analysis**, which involves examining each variable individually. While this provides helpful insights, it doesn't show how variables interact with each other.

**Multivariate analysis** allows us to explore relationships **between multiple variables**. This is crucial because real-world data is rarely independent — passenger survival, for instance, may depend not just on one factor (e.g., `sex`) but on several (like `sex`, `pclass`, and `age`).

By looking at how features interact, we can:
- Uncover hidden patterns
- Detect multicollinearity (when features are correlated with each other)
- Improve feature selection for modeling
- Gain a more holistic understanding of the data

Let's begin our multivariate exploration.

### What combinations of variables are associated with higher survival rates?

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Create a heatmap of survival rate by sex and class
pivot = titanic.pivot_table(index='sex', columns='pclass', values='survived')
sns.heatmap(pivot, annot=True, cmap='YlGnBu', fmt='.2f')
plt.title('Survival Rate by Sex and Class')
plt.ylabel('Sex')
plt.xlabel('Passenger Class')
plt.show()

We observe clear patterns:

- **Women had much higher survival rates** across all classes.
- **First-class passengers** had higher survival, regardless of sex.

This heatmap confirms the strong interaction between `sex` and `pclass` in determining survival outcomes.

### How does age interact with survival and passenger class?

In [None]:
sns.histplot(data=titanic, x='age', hue='survived', multiple='stack', bins=30)
plt.title('Survival Distribution by Age')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()

From this histogram:

- Younger children had a higher survival rate, likely due to evacuation priorities.
- Older adults (especially over age 60) had low survival rates.
- There's a noticeable overlap across ages — survival wasn't limited to any single group.


In [None]:
sns.boxplot(data=titanic, x='pclass', y='age', hue='survived')
plt.title('Age Distribution by Class and Survival')
plt.xlabel('Passenger Class')
plt.ylabel('Age')
plt.show()

The boxplot reveals:

- In all classes, survivors tended to be younger.
- First class had older passengers overall, but age still played a role in survival likelihood.
- Third class had a wider age distribution, but many non-survivors were concentrated here.


### Are there correlations among numerical features?

In [None]:
# Compute correlation matrix
corr = titanic[['age', 'fare', 'pclass', 'sibsp', 'parch', 'survived']].corr()

# Plot a heatmap
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

The correlation matrix shows:

- **Fare is negatively correlated with class** (lower class number = higher fare).
- **Fare has a weak positive correlation with survival** — more expensive tickets were slightly more likely to survive.
- **SibSp and Parch are positively correlated**, as expected (they both describe family connections).
- **Survival has modest correlations with fare and class**, but not strongly with age or family size.

Multivariate analysis reveals subtler trends that are not obvious from univariate views alone.