## **Predictive Analysis Plan for the Titanic Dataset**  

1. Define the Problem Statement
2. Data Preparation for Modeling
3. Model Selection
4. Model Training & Evaluation
5. Hyperparameter Tuning
6. Model Comparison & Selection
7. Final Model & Predictions
8. Model Interpretation & Explainability
9. Deployment (Optional)


### **1. Define the Problem Statement**  
- Goal: Predict passenger survival (`Survived` column) using available features.  
- Type of problem: Binary Classification (0 = Not Survived, 1 = Survived).  

The goal is to build a machine learning model that predicts whether a passenger survived the Titanic disaster based on various features such as age, sex, passenger class, and fare.

* Objective:

Input: Passenger details (e.g., Age, Sex, Pclass, etc.)
Output: Binary prediction — 0 (Did Not Survive) or 1 (Survived)


* Key Challenges to Address

Imbalanced classes (more non-survivors than survivors).
Presence of missing values (e.g., Cabin, Embarked).
Potentially correlated features (e.g., Pclass and Fare).
Non-linear relationships that might require feature engineering.

* Success Criteria (Metrics)

Primary Metric: F1 Score (balances precision & recall).
Secondary Metrics: Accuracy, ROC-AUC score, and Confusion Matrix for detailed insights.

In [1]:
# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats as stats
import time

In [2]:
test_df = pd.read_csv("../../data/cleaned/1-dropna/cleaned_testing.csv")
df = pd.read_csv("../../data/cleaned/1-dropna/cleaned_training.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


In [3]:
# check data types and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 712 entries, 0 to 711
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  712 non-null    int64  
 1   Survived     712 non-null    int64  
 2   Pclass       712 non-null    int64  
 3   Name         712 non-null    object 
 4   Sex          712 non-null    object 
 5   Age          712 non-null    float64
 6   SibSp        712 non-null    int64  
 7   Parch        712 non-null    int64  
 8   Ticket       712 non-null    object 
 9   Fare         712 non-null    float64
 10  Embarked     712 non-null    object 
dtypes: float64(2), int64(5), object(4)
memory usage: 61.3+ KB


### **2. Data Preparation for Modeling**
- **Feature Selection & Engineering**
  - Drop irrelevant features (`Name`, `Ticket` as they may not contribute).
  - Encode categorical variables (`Sex`, `Embarked`, `Deck`).
  - Convert `Fare` and `Age` into binned categories (if needed).
  - Create new features (e.g., `FamilySize = SibSp + Parch + 1`).


In [4]:
test_passenger_id = test_df.PassengerId
test_df.drop(['Name', 'Ticket'], axis=1, inplace=True)
train_passenger_id = df.PassengerId
df.drop(['Name', 'Ticket'], axis=1, inplace=True)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,male,22.0,1,0,7.25,S
1,2,1,1,female,38.0,1,0,71.2833,C
2,3,1,3,female,26.0,0,0,7.925,S
3,4,1,1,female,35.0,1,0,53.1,S
4,5,0,3,male,35.0,0,0,8.05,S


We'll encode some object type columns. 'Sex' and 'Embarked', both are nominal data items. So, One-hot-encoding is needed. 
Since, 'Sex' is a binary column in this dataset, we could label-encode it too. But, that makes it numerical. One-encoding makes it bool value, and that is the correct way.

In [5]:
# One-hot-encoding
df = pd.get_dummies(df, columns=['Sex', 'Embarked'], drop_first=True)
test_df = pd.get_dummies(test_df, columns=['Sex', 'Embarked'], drop_first=True)
# drop_first reduces multicollinearity by dropping one column from each encoded category.

In [6]:
test_df.head()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S
0,892,3,34.5,0,0,7.8292,True,True,False
1,893,3,47.0,1,0,7.0,False,False,True
2,894,2,62.0,0,0,9.6875,True,True,False
3,895,3,27.0,0,0,8.6625,True,False,True
4,896,3,22.0,1,1,12.2875,False,False,True


In [7]:
# label encoding
# le = LabelEncoder()
# df['Sex'] = le.fit_transform(df['Sex_male'])  # female → 0, male → 1
# test_df['Sex'] = le.fit_transform(test_df['Sex_male'])  # female → 0, male → 1

In [8]:
test_df.head(3)

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S
0,892,3,34.5,0,0,7.8292,True,True,False
1,893,3,47.0,1,0,7.0,False,False,True
2,894,2,62.0,0,0,9.6875,True,True,False


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 712 entries, 0 to 711
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  712 non-null    int64  
 1   Survived     712 non-null    int64  
 2   Pclass       712 non-null    int64  
 3   Age          712 non-null    float64
 4   SibSp        712 non-null    int64  
 5   Parch        712 non-null    int64  
 6   Fare         712 non-null    float64
 7   Sex_male     712 non-null    bool   
 8   Embarked_Q   712 non-null    bool   
 9   Embarked_S   712 non-null    bool   
dtypes: bool(3), float64(2), int64(5)
memory usage: 41.2 KB


Let's check correlation of fare and deck to see if we can group deck values, segregate them into classes.

Binning age and fare.
Binning age and fare helps reduce noise, improve interpretability, capture non-linear or threshold effects, and makes model outputs more actionable—especially when the relationships and data distributions call for it

In [10]:
fare_bins = [0, 10, 50, df['Fare'].max()]
fare_labels = [0, 1, 2]  # Low, Medium, High
test_df['Fare_Bin'] = pd.cut(test_df['Fare'], bins=fare_bins, labels=fare_labels, include_lowest=True)
df['Fare_Bin'] = pd.cut(df['Fare'], bins=fare_bins, labels=fare_labels, include_lowest=True)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S,Fare_Bin
0,1,0,3,22.0,1,0,7.25,True,False,True,0
1,2,1,1,38.0,1,0,71.2833,False,False,False,2
2,3,1,3,26.0,0,0,7.925,False,False,True,0
3,4,1,1,35.0,1,0,53.1,False,False,True,2
4,5,0,3,35.0,0,0,8.05,True,False,True,0


In [11]:
age_bins = [0, 12, 19, 59, df['Age'].max()]
age_labels = [0, 1, 2, 3]  # Child, Teen, Adult, Senior
test_df['Age_Bin'] = pd.cut(test_df['Age'], bins=age_bins, labels=age_labels)
df['Age_Bin'] = pd.cut(df['Age'], bins=age_bins, labels=age_labels)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S,Fare_Bin,Age_Bin
0,1,0,3,22.0,1,0,7.25,True,False,True,0,2
1,2,1,1,38.0,1,0,71.2833,False,False,False,2,2
2,3,1,3,26.0,0,0,7.925,False,False,True,0,2
3,4,1,1,35.0,1,0,53.1,False,False,True,2,2
4,5,0,3,35.0,0,0,8.05,True,False,True,0,2


In [12]:
df.Fare_Bin.value_counts()

Fare_Bin
1    340
0    236
2    136
Name: count, dtype: int64

In [13]:
test_df.drop(['Age', 'Fare'], inplace=True, axis=1)
df.drop(['Age', 'Fare'], inplace=True, axis=1)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,SibSp,Parch,Sex_male,Embarked_Q,Embarked_S,Fare_Bin,Age_Bin
0,1,0,3,1,0,True,False,True,0,2
1,2,1,1,1,0,False,False,False,2,2
2,3,1,3,0,0,False,False,True,0,2
3,4,1,1,1,0,False,False,True,2,2
4,5,0,3,0,0,True,False,True,0,2


Some feature engineering on SubSp and Parch

In [14]:
df['Family_Size'] = df['SibSp'] + df['Parch'] + 1  # Add 1 to include the passenger themselves
test_df['Family_Size'] = test_df['SibSp'] + test_df['Parch'] + 1

# and, we categorize this family size 
df['Family_Group'] = pd.cut(df['Family_Size'], 
                            bins=[0, 1, 4, 7, df['Family_Size'].max()],
                            labels=[0, 1, 2, 3])  # Solo, Small, Medium, Large

test_df['Family_Group'] = pd.cut(test_df['Family_Size'], 
                            bins=[0, 1, 4, 7, test_df['Family_Size'].max()],
                            labels=[0, 1, 2, 3])  # Solo, Small, Medium, Large


test_df.drop(['SibSp', 'Parch', 'Family_Size'], inplace=True, axis=1)
df.drop(['SibSp', 'Parch', 'Family_Size'], inplace=True, axis=1)
test_df.head()

Unnamed: 0,PassengerId,Pclass,Sex_male,Embarked_Q,Embarked_S,Fare_Bin,Age_Bin,Family_Group
0,892,3,True,True,False,0,2,0
1,893,3,False,False,True,0,2,1
2,894,2,True,True,False,0,3,0
3,895,3,True,False,True,0,2,0
4,896,3,False,False,True,1,2,1


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 712 entries, 0 to 711
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   PassengerId   712 non-null    int64   
 1   Survived      712 non-null    int64   
 2   Pclass        712 non-null    int64   
 3   Sex_male      712 non-null    bool    
 4   Embarked_Q    712 non-null    bool    
 5   Embarked_S    712 non-null    bool    
 6   Fare_Bin      712 non-null    category
 7   Age_Bin       712 non-null    category
 8   Family_Group  712 non-null    category
dtypes: bool(3), category(3), int64(3)
memory usage: 21.5 KB


In [16]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 331 entries, 0 to 330
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   PassengerId   331 non-null    int64   
 1   Pclass        331 non-null    int64   
 2   Sex_male      331 non-null    bool    
 3   Embarked_Q    331 non-null    bool    
 4   Embarked_S    331 non-null    bool    
 5   Fare_Bin      331 non-null    category
 6   Age_Bin       331 non-null    category
 7   Family_Group  331 non-null    category
dtypes: bool(3), category(3), int64(2)
memory usage: 7.8 KB


In [18]:
# saving modelling ready data
df.to_csv("../../data/processed/1-dropna/training_model_ready.csv", index=False)
test_df.to_csv("../../data/processed/1-dropna/testing_model_ready.csv", index=False) 