# Topic 7 - Data Analysis with Python

In this section, we'll perform **exploratory data analysis (EDA)** and **statistical analysis** using Python to gain insights into the Titanic dataset. This stage of the data science workflow helps us better understand the variables, identify trends and relationships, and form hypotheses that can guide future modeling or deeper investigation.

We'll make use of Python libraries such as `pandas`, `matplotlib`, and `seaborn` to:
- Explore data distributions
- Investigate relationships between features
- Answer data-driven questions
- Support our findings with visualizations and summary statistics

## Recap: Data Cleaning

Before diving into analysis, it’s important to ensure that the dataset has been properly cleaned. Here's a brief recap of the data cleansing steps we performed earlier:

1. **Missing values** were addressed in several columns:
   - `cabin`: Filled with the string `'None'` to indicate missing cabin info.
   - `age`: Filled with the **mean age** of all passengers.
   - `embarked`: Filled with the **most common port** (mode).
   - `fare`: Imputed with the **median fare** for passengers of the same class and embarkation point.

2. **Feature Engineering** included:
   - Extracting the `title` (e.g., Mr, Miss, etc.) from the `name` column.
   - Standardizing rare and variant titles into broader categories.

3. **Sanity checks**:
   - Verified that there are no remaining missing values in the dataset.
   
With these preprocessing steps complete, we are now ready to begin exploring the data in more depth.

## Univariate Analysis

In this section, we’ll examine individual features in the Titanic dataset to understand their distributions and characteristics. We’ll use a mix of statistical summaries and visualizations to answer key questions.

### What is the distribution of passengers' ages?

Understanding the age distribution helps us learn about the demographic of Titanic passengers. It also allows us to identify patterns such as age groups that are over- or under-represented.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set a consistent style
sns.set(style='whitegrid')

# Plot the age distribution
plt.figure(figsize=(10, 6))
sns.histplot(titanic['age'], bins=30, kde=True, color='skyblue')
plt.title('Distribution of Passenger Ages')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()

In [None]:
# Basic statistics for age
titanic['age'].describe()

### How were passengers distributed by travel class?

The `pclass` feature indicates the socio-economic status of a passenger (1 = upper, 2 = middle, 3 = lower). Let's see how many passengers were in each class.

In [None]:
# Countplot for travel class
plt.figure(figsize=(8, 5))
sns.countplot(x='pclass', data=titanic, palette='Set2')
plt.title('Passenger Count by Class')
plt.xlabel('Passenger Class')
plt.ylabel('Count')
plt.show()

In [None]:
# Percentage breakdown by class
titanic['pclass'].value_counts(normalize=True).mul(100).round(2)

### What were the most common embarkation points?

Passengers boarded the Titanic from three ports: Cherbourg (C), Queenstown (Q), and Southampton (S). Let’s see how the passengers were distributed across these ports.

In [None]:
# Countplot for embarkation ports
plt.figure(figsize=(8, 5))
sns.countplot(x='embarked', data=titanic, palette='pastel')
plt.title('Passenger Count by Embarkation Port')
plt.xlabel('Embarkation Port')
plt.ylabel('Count')
plt.show()

In [None]:
# Relative frequencies
titanic['embarked'].value_counts(normalize=True).mul(100).round(2)

### How is fare distributed among passengers?

The `fare` column indicates how much each passenger paid for their ticket. Let’s look at the distribution and see if there are any extreme values or skewness.

In [None]:
# Fare distribution
plt.figure(figsize=(10, 6))
sns.histplot(titanic['fare'], bins=40, kde=True, color='lightgreen')
plt.title('Distribution of Fare Prices')
plt.xlabel('Fare')
plt.ylabel('Count')
plt.xlim(0, 300)  # Remove extreme outliers for better visibility
plt.show()

In [None]:
# Summary statistics for fare
titanic['fare'].describe()