# Titanic Survival Prediction: 01 - Initial data exploration
*Date: 2025-06-25*
*Author: Jonas Lilletvedt*

--- 

## 1. Objective

In this notebook we will perform an initial exploratory data analysis (EDA) on the training data. We want to:
*   Understand the data structure, variables, and statistical properties of the data set.
*   Identify and quantify data quality problems, this includes NANs, outliers and nonrational values.
*   Construct a clear path for the data cleaning and feature engingeering for the following notebooks.

## 2. Data source

The data used in this notebook is the 'train.csc' file froim the [Kaggle "Titanic - Machine Learning from the Disaster" competiotion] (https://www.kaggle.com/competitions/titanic).

## 3. Plan for the inital expection

We will proceed in the following steps:
1.  **Setup** Import necesarry libraries and load the raw data.
2.  **Initial inspection** A high level overview using buil in functions from pandas, like `.info()`, `.head()` and `.describe()`.
3.  **Data visualization (EDA):**
    *   Analyze the target variable (`Survived`).
    *   Analyze the individual features (Univariate analysis).
    *   Analyze the relationship between features and the target (Bivaritate analysis).
4. **Summary and Next steps:** Document key findins and define a clear plan for further notebooks.


## 1: Setup and loading

In [23]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [24]:
# Load the data
df_train = pd.read_csv('../data/01_raw/train.csv')

# Show the five first rows
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## 2: Initial Data Inspection

We will start with a high-level overview of the training data, to understand its structure, and identiy quality issues, and review basic statistics.

### 2.1 Data structure and Null values (`.info()`)

We will first use `.info()` method to get a short summary of the training dataset. This will show us column names and types. As well as the count of missing values for each column.

In [25]:
# Display data types and missing value counts for the columns
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


**Observations from `.info()`:**

An initial review of the 891 entries using `.info()` reveals tree columns with data quality issues that requires cleaning.

--- 

**`Age` (Numeric)**
*   **Issue**: 177 missing values.
*   **Possible strategy**: **Impute with median** as a baseline. We will explore if more complex methods would provide a better result during model iteration.

**`Cabin` (Categorical)**
*   **Issue**: Severe data loss, missing 687 values (over 77% missing).
*   **Possible strategy**: Impute using k-neares neighbours, or drop the whole column if it does not provide any value addtitional value to prediction of target. Pclass has could be a good plausible indicator for cabin.

**`Embarked` (Categorical)**
*   **Issue**: Two missing values.
*   **Possible strategy**: **Impute with mode**.

### 2.2 Statistical Summary for Numerical and Categorial Features (`.describe()` and `.describe(inlcude=[`object`]))

Next step is to examine descritptive statistics for numerical and categorical columns.

In [26]:
df_train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


**Observations from Numerical Feature summary (`.describe()`)**

A statistical summary of numerical variables in the dataset.

---

*   **Survival rate** The mean of `Survived` column is 0.38, with no missing values. 38% of passengers in the training data survied.
*   **Fare Outliers** The 75% quantile is severly lower than the max `Fare` value, 31$ compared to 512$. This indicates strong outliers that we will need to handle, possibly with an imputaion technique.



In [27]:
df_train.describe(include=['object'])

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Dooley, Mr. Patrick",male,347082,G6,S
freq,1,577,7,4,644


**Observations from Categorical Feature summary (`.describe(include=['object']))**

A summary of categorical variables reveals key insights that dictates our feature engineering strategy.

--- 

*   **Name**
    *   **Observation**: As expected `Name` has a **high cardinality** with all 891 entries being unique. In its raw form, the feature is unsuitable for direct use as it would likely only introduce noise.
    *   **Hypothesis**: Despite its high cardinality, the `Name` feature is rich with implicit information. Many of the names are prefixed with a title like *Mrs*, *Mr* or *Dr*. These titles indicates sex, age if no title, marriage and economic and socioeconomic status. Last but not least the surname could be used to indicate family members. There is a large possibility that family members stick together during crisis.
        *   **Titles**: Prefixed like *Mr.*, *Miss.*, *Mrs.* and *Master.* are strong proxies for sex, age and marital status. Other titles like *Dr.* or *Rev* can indicate socioeconomic status and profession.
        *   **Surname**: The surname can be used to identify potential family members groups that might not be fully captured by `SibSp` and `Parch` features. It's hypothesized that family members may have had survival outcomes due to staying together. 
    *   **Strategy**: Engineer two new features by decomposing the `Name` column.
        *   `title`: A categorical feature extracted from the name prefix. To prevent possible over-fitting and noise rare titles will be consolidated into a single 'Rare' category.
        *   `family_survival_rate`: A numerical feature representing the survival rate of a passengers's family unit. 
            *   **Family Identification**: A robust Family_ID will be created by combining the passenger's `surname` and `ticket_prefix`. This approach helps prevent misidentifying different families with the same surname, but traveling with a different ticket type. 
            *   **Leakage Proof calculation**: To avoid data leakage, the survival rate will be calculated for each passenger's family excluding the passenger themselves.
            *   **Imputation for solo travelers**: For passengers with no identifiable family members. this feature will be imputed with the overall survival rate of the entire dataset.

*   **Sex distribution**
    *   **Observation***: The dataset is imbalanced, containing **577 males (65%) and 314 females (35%)**.
    *   **Hypothesis**: Due to females having a higher rescue priority, `Sex` will be a primary indicator of `Survived`. We will validate this with visualizations. 

*   **`Ticket`**
    *   **Issue**: The column has a very **high cardinality** (681 unique values). Making the raw feature unsuitable for direct use.
    *   **Hypothesis**: Passengers with closely related ticket-numbers might travel together or have other properties that could predict survival rate.
    *   **Strategy**: We will engineer new features based on the `Ticket` string and validate them against `Survived`:
        *   `ticket_group_size`: A numerical feature counting passengers on the same ticket.
        *   `ticket_prefix`: A categorical feature for any text-based prefix (indicating economical class).

*   **`Cabin`**
    *   **Issue**: In addition to ~77% missing values, the column has a high cardinality. This proposes the same obstacle as for `Ticket`.
    *   **Hypothesis**: A passenger's physical cabin location on  the ship, encoded in `Cabin` string, is a strong predictor for survival.
    *   **Strategy**: Deconstruct the `Cabin` string into vertical and horizontal location:
        *   `cabin_prefix`: A categorical feature for any text based-prefix (the deck, vertical location).
        *   `cabin_zone`: A categorical feature from binned cabin numbers. Representing the horizontal location on the deck.

## 3: Data Visualization