# Lab 1: Exploratory Data Analysis (EDA) and Feature Engineering
# Case Study: Titanic Survival Analysis

## Author: [Mariano Garralda]
### Academic Year: 2024-2025

---

## Table of Contents
1. [Problem Understanding and Objectives](#1)
2. [Dataset Description](#2)
3. [Exploratory Data Analysis (EDA)](#3)
    - [3.1 Data Loading and Overview](#3.1)
    - [3.2 Data Cleaning](#3.2)
    - [3.3 Univariate Analysis](#3.3)
    - [3.4 Bivariate Analysis](#3.4)
    - [3.5 Correlation Analysis](#3.5)
4. [Feature Engineering](#4)
    - [4.1 Handling Missing Values](#4.1)
    - [4.2 Feature Creation](#4.2)
    - [4.3 Feature Transformation](#4.3)
    - [4.4 Feature Encoding](#4.4)
    - [4.5 Feature Scaling](#4.5)
    - [4.6 Feature Selection](#4.6)
5. [Conclusion](#5)
6. [References](#6)


<a id='1'></a>

## 1. Problem Understanding and Objectives

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, leading to the deaths of 1,502 out of 2,224 passengers and crew.

**Objective:**  
The primary goal of this case study is to build a predictive model that answers the question: *"What sorts of people were more likely to survive?"* We aim to apply Exploratory Data Analysis (EDA) and Feature Engineering techniques to prepare the data for modeling.


<a id='2'></a>

## 2. Dataset Description

The dataset is provided by Kaggle and contains information about the passengers aboard the Titanic.

**Features:**

- **PassengerId**: Unique ID for each passenger.
- **Survived**: Survival status (0 = No, 1 = Yes).
- **Pclass**: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd).
- **Name**: Passenger's name.
- **Sex**: Passenger's sex.
- **Age**: Passenger's age in years.
- **SibSp**: Number of siblings/spouses aboard.
- **Parch**: Number of parents/children aboard.
- **Ticket**: Ticket number.
- **Fare**: Passenger fare.
- **Cabin**: Cabin number.
- **Embarked**: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).


<a id='3'></a>

## 3. Exploratory Data Analysis (EDA)

### 3.1 Data Loading and Overview

In [17]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# For displaying plots inline
%matplotlib inline

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Load the dataset
train_df = pd.read_csv('/work/resources/train.csv')
test_df = pd.read_csv('/work/resources/test.csv')

# Display first few rows
train_df.head()

Get basic information about the dataset:

In [18]:
train_df.info()

In [19]:
train_df.describe()

Check for missing values:

In [20]:
train_df.isnull().sum()

**Observation:**

- **Age**: 177 missing values.
- **Cabin**: 687 missing values.
- **Embarked**: 2 missing values.


In [21]:
test_df.info()

In [22]:
test_df.describe()

In [23]:
test_df.isnull().sum()

### 3.2 Data Cleaning

**Handling Missing Values:**

- **Cabin**: A significant portion is missing (~77%). We will consider dropping this column or creating a new feature indicating whether a passenger had a cabin assigned.
- **Age**: Contains missing values. We will impute missing ages.
- **Embarked**: Contains 2 missing values. We can impute with the mode.

### 3.3 Univariate Analysis

#### Survived (class label)

In [24]:
# Plot the distribution of survival
sns.countplot(x='Survived', data=train_df)
plt.title('Distribution of Survival')
plt.show()

**Observation:**

- Approximately 38% survived and 62% did not survive.

#### Age

In [25]:
# Histogram of Age
train_df['Age'].hist(bins=30, edgecolor='black')
plt.xlabel('Age')
plt.ylabel('Count')
plt.title('Distribution of Age')
plt.show()

**Observation:**

- Age is approximately normally distributed with some skewness.

### Task: Analyses other features

In [26]:
# Get numerical columns
numerical_cols = train_df.select_dtypes(include=[np.number]).columns

# Plot histogram of numerical columns
for col in numerical_cols:
    train_df[col].hist(bins=30, edgecolor='black')
    plt.xlabel(col)
    plt.ylabel('Count')
    plt.title('Distribution of ' + col)
    plt.show()

In [27]:
df_categorical = train_df.select_dtypes(include=['object'])
df_categorical

In [28]:
# Plot count plot for each categorical column
for col in df_categorical.columns:
    sns.countplot(x=col, data=train_df)
    plt.title('Count plot for ' + col)
    plt.show()

### 3.4 Bivariate Analysis

#### Survival by Sex

In [29]:
# Survival rate by Sex
sns.barplot(x='Sex', y='Survived', data=train_df)
plt.title('Survival Rate by Sex')
plt.show()

**Observation:**

- Females have a significantly higher survival rate than males.

#### Survival by Pclass

In [30]:
# Survival rate by Pclass
sns.barplot(x='Pclass', y='Survived', data=train_df)
plt.title('Survival Rate by Passenger Class')
plt.show()

**Observation:**

- Higher survival rate in 1st class, lower in 3rd class.

### Task: Analyses other features

#### Age Distribution by Survival

In [31]:
# KDE plot of Age distribution by Survival
# Even though the Age is a discrete value, in practice, it is often treated as a continuous variable to create the Probability Density Function
# (PDF) and visualize the distribution.
sns.kdeplot(train_df.loc[train_df['Survived'] == 0, 'Age'], label='Did Not Survive')
sns.kdeplot(train_df.loc[train_df['Survived'] == 1, 'Age'], label='Survived')
plt.xlabel('Age')
plt.ylabel('Density')
plt.title('Age Distribution by Survival')
plt.legend()
plt.show()

### Task: Analyses other features


### 3.5 Correlation Analysis 

### 3.5.1 Pearson Correlation Coefficient (linear pairwise relationship)


In [32]:
# Compute the correlation matrix
corr_matrix = train_df.corr(method='pearson')
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

### Task: Why about the above error?
Fix it.

In [24]:
# Select only numerical columns
numerical_df = train_df.select_dtypes(include=[np.number])

# Compute the correlation matrix
corr_matrix = numerical_df.corr(method='pearson')

# Plot the heatmap
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

### Task: Implement Spearman correlation coefficient
What is the main difference between Pearson correlation coefficient?

What observations can you make from the correlation matrix?

In [25]:
numerical_df = train_df.select_dtypes(include=[np.number])
corr_matrix = numerical_df.corr(method='spearman')
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

Pearson correlation detects linear relationships, while Spearman correlation also detects non-linear relationships.

![image.png](attachment:image.png)

In [26]:
# Plot correlation between Pclass and Fare
sns.scatterplot(x='Pclass', y='Fare', data=train_df)

### Task Home: Apply some test to analysis whether two categorical variables are independent. Check the class material of this unit.

In [33]:
from scipy.stats import chi2_contingency

<a id='4'></a>

## 4. Feature Engineering

### 4.1 Handling Missing Values

#### Age Imputation

In [39]:
# Create a function to impute missing ages
def impute_age(row):
    Age = row['Age']
    Pclass = row['Pclass']
    Sex = row['Sex']
    if pd.isnull(Age):
        return median_ages[Pclass][Sex]
    else:
        return Age

# Convert 'Sex' to numeric
train_df['Sex'] = train_df['Sex'].map({'female': 1, 'male': 0})
test_df['Sex'] = test_df['Sex'].map({'female': 1, 'male': 0})

# Calculate median ages
median_ages = train_df.groupby(['Pclass', 'Sex'])['Age'].median()

# Impute Age
train_df['Age'] = train_df.apply(impute_age, axis=1)
test_df['Age'] = test_df.apply(impute_age, axis=1)

print("Train", train_df['Age'].isnull().sum())
print("Test", test_df['Age'].isnull().sum())

#### Embarked Imputation

Impute missing **Embarked** values with the mode.

In [38]:
# Fill missing Embarked with mode
train_df['Embarked'].fillna(train_df['Embarked'].mode()[0], inplace=True)
test_df['Embarked'].fillna(test_df['Embarked'].mode()[0], inplace=True)

print("Train", train_df['Embarked'].isnull().sum())
print("Test", test_df['Embarked'].isnull().sum())

#### Cabin Feature

Create a new boolean feature **HasCabin** indicating whether a passenger had a cabin assigned. True or False.

In [45]:
# Create HasCabin feature
train_df['HasCabin'] = train_df['Cabin'].notnull()#.astype(int)
test_df['HasCabin'] = test_df['Cabin'].notnull().astype(int)


# Drop Cabin column
train_df.drop('Cabin', axis=1, inplace=True)
test_df.drop('Cabin', axis=1, inplace=True)

### Task: After create the new feature, check if the type of the new HasCabin feature is correct.

In [44]:
test_df['HasCabin'].astype()

### 4.2 Feature Creation

#### Title Extraction

In [None]:
# Extract titles
train_df['Title'] = train_df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
test_df['Title'] = test_df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)

# Map titles
title_mapping = {
    "Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 5,
    "Col": 5, "Major": 5, "Mlle": 2, "Countess": 5, "Ms": 2, "Lady": 5,
    "Jonkheer": 5, "Don": 5, "Dona": 5, "Mme": 3, "Capt": 5, "Sir": 5
}

train_df['Title'] = train_df['Title'].map(title_mapping)
test_df['Title'] = test_df['Title'].map(title_mapping)

# Fill missing titles with 0
train_df['Title'].fillna(0, inplace=True)
test_df['Title'].fillna(0, inplace=True)

#### Family Size

Create a **FamilySize** feature by combining **SibSp** and **Parch**.

In [None]:
# Create FamilySize feature
train_df['FamilySize'] = train_df['SibSp'] + train_df['Parch'] + 1
test_df['FamilySize'] = test_df['SibSp'] + test_df['Parch'] + 1

#### IsAlone

Create an **IsAlone** feature.

In [None]:
# Create IsAlone feature
train_df['IsAlone'] = 0
train_df.loc[train_df['FamilySize'] == 1, 'IsAlone'] = 1

test_df['IsAlone'] = 0
test_df.loc[test_df['FamilySize'] == 1, 'IsAlone'] = 1

### 4.3 Feature Transformation

#### Fare Binning (discretization)

Binning, also known as discretization, is a data preprocessing technique where continuous numerical variables are converted into categorical variables by dividing them into intervals, or "bins."
This process simplifies the data and can help improve the performance of certain machine learning algorithms, especially those that are sensitive 
to the scale or distribution of the data.

#### When to Use Binning
Dealing with Outliers: When your data contains outliers that could skew the analysis.
Preparing Data for Certain Algorithms: Some machine learning models or algorithms work better with categorical variables.
Simplifying Complex Data: To make complex data more understandable and interpretable.

In [None]:
# Fill missing Fare in test dataset
test_df['Fare'].fillna(test_df['Fare'].median(), inplace=True)

# Fare Binning
train_df['FareBand'] = pd.qcut(train_df['Fare'], 4, labels=[1, 2, 3, 4])
test_df['FareBand'] = pd.qcut(test_df['Fare'], 4, labels=[1, 2, 3, 4])

#### Age Binning

In [None]:
# Age Binning
train_df['AgeBand'] = pd.cut(train_df['Age'], 5, labels=[1, 2, 3, 4, 5])
test_df['AgeBand'] = pd.cut(test_df['Age'], 5, labels=[1, 2, 3, 4, 5])

### 4.4 Feature Encoding

Drop unnecessary columns and encode categorical variables.

#### Drop Unnecessary Columns

In [None]:
# Drop unnecessary columns
train_df.drop(['PassengerId', 'Name', 'Ticket', 'Fare', 'Age'], axis=1, inplace=True)
test_df.drop(['Name', 'Ticket', 'Fare', 'Age'], axis=1, inplace=True)

#### Convert Categorical Variables

In [None]:
# Encode Embarked
train_df['Embarked'] = train_df['Embarked'].map({'S': 1, 'C': 2, 'Q': 3})
test_df['Embarked'] = test_df['Embarked'].map({'S': 1, 'C': 2, 'Q': 3})

# Fill any remaining missing values
train_df['Embarked'].fillna(1, inplace=True)
test_df['Embarked'].fillna(1, inplace=True)

### 4.5 Feature Scaling

Not required as all features are now categorical or numerical within a fixed range due to binning.

### 4.6 Feature Selection

#### Correlation with Survived

In [None]:
# Compute correlation with Survived
corr = train_df.corr()
corr['Survived'].sort_values(ascending=False)

#### Recursive Feature Elimination (RFE)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE

# Separate features and target variable
X = train_df.drop('Survived', axis=1)
y = train_df['Survived']

# Initialize logistic regression model
logreg = LogisticRegression(max_iter=200)

# Initialize RFE
selector = RFE(logreg, n_features_to_select=8)

# Fit RFE
selector = selector.fit(X, y)

# Get selected features
selected_features = X.columns[selector.support_]
print('Selected features:', selected_features)


**Observation:**

- The selected features will be used for modeling.

<a id='5'></a>

## 5. Conclusion

In this case study, we performed an extensive Exploratory Data Analysis (EDA) and Feature Engineering on the Titanic dataset. We:

- **Analyzed** the distribution of variables and their relationship with survival.
- **Handled missing values** using appropriate imputation techniques.
- **Created new features** such as **Title**, **FamilySize**, and **IsAlone** to capture additional information.
- **Transformed features** by binning continuous variables.
- **Encoded categorical variables** into numerical format.
- **Selected significant features** using correlation analysis and Recursive Feature Elimination (RFE).

These steps prepared the data for modeling, which would involve training machine learning algorithms to predict passenger survival.

<a id='6'></a>

## 6. References

1. Kaggle Titanic Competition: [https://www.kaggle.com/c/titanic](https://www.kaggle.com/c/titanic)
2. Géron, A. (2019). *Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow*. O'Reilly Media.
3. Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning*. Springer.
4. McKinney, W. (2017). *Python for Data Analysis*. O'Reilly Media.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=d9b716da-a519-4d40-8a21-438227d42336' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>