# 🧭 **Titanic: EDA, Processing & Experimentation**

### 🚀 **Setup**

1. [Import Dependencies](#1-import-dependencies)
2. [Set Input Directory](#2-set-input-directory)

### 📥 **Data Loading & Quick Checks**

3. [Load Training Data](#3-load-training-data)
4. [Peek & Missing Values](#4-peek--missing-values)
5. [Categorical Value Counts (Embarked)](#5-categorical-value-counts)

### 🧹 **Preprocessing & Feature Engineering**

6. [Impute & Encode](#6-impute--encode)
7. [Derived Features](#7-derived-features)
8. [Define X/y + Class Balance](#8-define-xy--class-balance)

### ⚖️ **Resampling & Split**

9. [SMOTE Oversampling](#9-smote-oversampling)
10. [Train/Test Split](#10-traintest-split)

### 🧠 **Modelling & Evaluation**

11. [Random Forest + Randomised Search](#11-random-forest--randomised-search)
12. [Evaluate Best Model](#12-evaluate-best-model)

## 🚀 **Setup**

### 🧩 **1. Import Dependencies <a id="1-import-dependencies"></a>**

In [1]:
# -------------------------------------------------------------------
# 📦 Core libraries
# -------------------------------------------------------------------
import os
import sys
import pandas as pd
import numpy as np

# -------------------------------------------------------------------
# 🧮 Modelling & evaluation
# -------------------------------------------------------------------
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# -------------------------------------------------------------------
# ⚖️ Imbalance handling
# -------------------------------------------------------------------
from imblearn.over_sampling import SMOTE

# -------------------------------------------------------------------
# 📊 Optional display settings (nicer tables in notebooks)
# -------------------------------------------------------------------
pd.set_option("display.max_columns", 100)
pd.set_option("display.width", 120)

print("✅ Imports loaded.")

✅ Imports loaded.


### 📁 **2. Set Input Directory <a id="2-set-input-directory"></a>**

In [2]:
# -------------------------------------------------------------------
# 🗂️ Resolve project root safely in a notebook (simulate __file__ behavior)
# -------------------------------------------------------------------
# If this notebook lives in "notebook/", we go one level up to the repo root.
PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), ".."))

# -------------------------------------------------------------------
# 📥 Path to training CSV inside artifacts/raw/
# -------------------------------------------------------------------
DATA_PATH = os.path.join(PROJECT_ROOT, "artifacts", "raw", "titanic_train.csv")

# -------------------------------------------------------------------
# 🧭 Sanity Check
# -------------------------------------------------------------------
print("PROJECT_ROOT:", PROJECT_ROOT)
print("DATA_PATH   :", DATA_PATH)
print("✅ File exists:", os.path.exists(DATA_PATH))

PROJECT_ROOT: c:\Users\HP\OneDrive\Documents\Projects\MLOps\MLOps-Titanic-Survival
DATA_PATH   : c:\Users\HP\OneDrive\Documents\Projects\MLOps\MLOps-Titanic-Survival\artifacts\raw\titanic_train.csv
✅ File exists: True


## 📥 **Data Loading & Quick Checks**

### 🧾 **3. Load Training Data <a id="3-load-training-data"></a>**

In [3]:
# -------------------------------------------------------------------
# 📥 Load the Titanic training dataset
# -------------------------------------------------------------------
titanic = pd.read_csv(DATA_PATH)

# Quick shape check
print(f"✅ Loaded titanic dataset: {titanic.shape[0]:,} rows x {titanic.shape[1]} columns")

✅ Loaded titanic dataset: 712 rows x 12 columns


### 👀 **4. Peek & Missing Values <a id="4-peek--missing-values"></a>**

In [4]:
# -------------------------------------------------------------------
# 🧭 Quick peek at the head and missing values snapshot
# -------------------------------------------------------------------
display(titanic.head())
display(titanic.isnull().sum().sort_values(ascending=False).to_frame("missing_count").T)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,332,0,1,"Partner, Mr. Austen",male,45.5,0,0,113043,28.5,C124,S
1,734,0,2,"Berriman, Mr. William John",male,23.0,0,0,28425,13.0,,S
2,383,0,3,"Tikkanen, Mr. Juho",male,32.0,0,0,STON/O 2. 3101293,7.925,,S
3,705,0,3,"Hansen, Mr. Henrik Juul",male,26.0,1,0,350025,7.8542,,S
4,814,0,3,"Andersson, Miss. Ebba Iris Alfrida",female,6.0,4,2,347082,31.275,,S


Unnamed: 0,Cabin,Age,Embarked,PassengerId,Survived,Pclass,Name,Sex,SibSp,Parch,Ticket,Fare
missing_count,553,140,2,0,0,0,0,0,0,0,0,0


### 🧮 **5. Categorical Value Counts (Embarked) <a id="5-categorical-value-counts"></a>**

In [5]:
# -------------------------------------------------------------------
# 🧭 Inspect 'Embarked' distribution (pre-imputation)
# -------------------------------------------------------------------
print("Embarked value counts (before imputation):")
print(titanic["Embarked"].value_counts(dropna=False))

Embarked value counts (before imputation):
Embarked
S      525
C      125
Q       60
NaN      2
Name: count, dtype: int64


### 🧼 **6. Impute & Encode <a id="6-impute--encode"></a>**

In [6]:
# -------------------------------------------------------------------
# 🧼 Impute numeric and categorical features
# -------------------------------------------------------------------
# Fill Age with median
titanic['Age'] = titanic['Age'].fillna(titanic['Age'].median())

# Fill Embarked with mode
titanic['Embarked'] = titanic['Embarked'].fillna(titanic['Embarked'].mode()[0])

# Fill Fare with median
titanic['Fare'] = titanic['Fare'].fillna(titanic['Fare'].median())

# -------------------------------------------------------------------
# 🔢 Encode categorical features
# -------------------------------------------------------------------
# Sex: male=0, female=1
titanic['Sex'] = titanic['Sex'].map({'male': 0, 'female': 1})

# Embarked as ordinal codes (C=0/S=1/Q=2 after category ordering; exact numbers not important for trees)
titanic['Embarked'] = titanic['Embarked'].astype('category').cat.codes

# Post-imputation check
print("Embarked value counts (after imputation & encoding):")
print(titanic["Embarked"].value_counts())

Embarked value counts (after imputation & encoding):
Embarked
2    527
0    125
1     60
Name: count, dtype: int64


### 🧩 **7. Derived Features <a id="7-derived-features"></a>**

In [7]:
# -------------------------------------------------------------------
# 👨‍👩‍👧 Family-based features
# -------------------------------------------------------------------
titanic['Familysize'] = titanic['SibSp'] + titanic['Parch'] + 1
titanic['Isalone'] = (titanic['Familysize'] == 1).astype(int)

# -------------------------------------------------------------------
# 🛳️ Cabin indicator (presence/absence)
# -------------------------------------------------------------------
titanic['HasCabin'] = titanic['Cabin'].notnull().astype(int)

# -------------------------------------------------------------------
# 🏷️ Title extraction (Mr/Miss/Mrs/Master/Rare)
# -------------------------------------------------------------------
title_map = {'Mr': 0, 'Miss': 1, 'Mrs': 2, 'Master': 3}
titanic['Title'] = (
    titanic['Name']
    .str.extract(r' ([A-Za-z]+)\.', expand=False)
    .replace({
        # Map less common honorifics to 'Rare'
        'Mlle': 'Miss', 'Ms': 'Miss', 'Mme': 'Mrs',
        'Lady': 'Rare', 'Countess': 'Rare', 'Sir': 'Rare',
        'Jonkheer': 'Rare', 'Don': 'Rare', 'Dona': 'Rare',
        'Col': 'Rare', 'Major': 'Rare', 'Capt': 'Rare',
        'Dr': 'Rare', 'Rev': 'Rare'
    })
    .map({**title_map, 'Rare': 4})
    .fillna(4)  # Unknown → Rare
    .astype(int)
)

# -------------------------------------------------------------------
# ➗ Simple interaction features (often useful for trees)
# -------------------------------------------------------------------
titanic['Pclass_Fare'] = titanic['Pclass'] * titanic['Fare']
titanic['Age_Fare'] = titanic['Age'] * titanic['Fare']

# Preview engineered columns
display(titanic[['Survived','Pclass','Sex','Age','Fare','Embarked',
                 'Familysize','Isalone','HasCabin','Title','Pclass_Fare','Age_Fare']].head(3))

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked,Familysize,Isalone,HasCabin,Title,Pclass_Fare,Age_Fare
0,0,1,0,45.5,28.5,2,1,1,1,0,28.5,1296.75
1,0,2,0,23.0,13.0,2,1,1,0,0,26.0,299.0
2,0,3,0,32.0,7.925,2,1,1,0,0,23.775,253.6


### 🎯 **8. Define X/y + Class Balance <a id="8-define-xy--class-balance"></a>**

In [8]:
# -------------------------------------------------------------------
# 🎯 Feature matrix (X) and target vector (y)
# -------------------------------------------------------------------
FEATURES = [
    'Pclass', 'Sex', 'Age', 'Fare', 'Embarked',
    'Familysize', 'Isalone', 'HasCabin', 'Title',
    'Pclass_Fare', 'Age_Fare'
]
X = titanic[FEATURES].copy()
y = titanic['Survived'].astype(int).copy()

print("X shape:", X.shape, "| y shape:", y.shape)
print("\nClass balance (original):")
print(y.value_counts().rename({0: "Not Survived", 1: "Survived"}))

X shape: (712, 11) | y shape: (712,)

Class balance (original):
Survived
Not Survived    444
Survived        268
Name: count, dtype: int64


## ⚖️ **Resampling & Split**

### 🧪 **9. SMOTE Oversampling <a id="9-smote-oversampling"></a>**

In [9]:
# -------------------------------------------------------------------
# ⚖️ Balance classes using SMOTE (on training data usually; here applied pre-split per your original flow)
# -------------------------------------------------------------------
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

print("After SMOTE:")
print(pd.Series(y_resampled).value_counts().rename({0: "Not Survived", 1: "Survived"}))
print("X_resampled shape:", X_resampled.shape, "| y_resampled shape:", y_resampled.shape)

After SMOTE:
Survived
Not Survived    444
Survived        444
Name: count, dtype: int64
X_resampled shape: (888, 11) | y_resampled shape: (888,)


### ✂️ **10. Train/Test Split <a id="10-traintest-split"></a>**

In [10]:
# -------------------------------------------------------------------
# ✂️ Hold-out split for evaluation
# -------------------------------------------------------------------

X_train, X_test, y_train, y_test = train_test_split(
    X_resampled, y_resampled,
    test_size=0.20,
    random_state=42,
    stratify=y_resampled
)

print("Train shapes:", X_train.shape, y_train.shape)
print("Test  shapes:", X_test.shape,  y_test.shape)

Train shapes: (710, 11) (710,)
Test  shapes: (178, 11) (178,)


## 🧠 **Modelling & Evaluation**
### 🌲 **11. Random Forest + Randomised Search <a id="11-random-forest--randomised-search"></a>**

In [11]:
# -------------------------------------------------------------------
# 🌲 Random Forest with a compact RandomizedSearchCV
# -------------------------------------------------------------------
param_distributions = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

rf = RandomForestClassifier(random_state=42, n_jobs=-1)

random_search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_distributions,
    n_iter=10,
    cv=3,
    scoring='accuracy',
    random_state=42,
    n_jobs=-1,
    verbose=0
)

random_search.fit(X_train, y_train)

best_rf = random_search.best_estimator_
print("✅ Best Params:", random_search.best_params_)

✅ Best Params: {'n_estimators': 300, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_depth': 30}


### 📈 **12. Evaluate Best Model <a id="12-evaluate-best-model"></a>**

In [12]:
# -------------------------------------------------------------------
# 📈 Predict, score, and print simple diagnostics
# -------------------------------------------------------------------
y_pred = best_rf.predict(X_test)
rf_accuracy = accuracy_score(y_test, y_pred)

print(f"🎯 Random Forest Accuracy: {rf_accuracy:.3f}\n")

print("Classification Report:")
print(classification_report(y_test, y_pred, digits=3))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# Optional: top feature importances (quick look)
fi = pd.Series(best_rf.feature_importances_, index=FEATURES).sort_values(ascending=False)
display(fi.to_frame("importance").T)

🎯 Random Forest Accuracy: 0.865

Classification Report:
              precision    recall  f1-score   support

           0      0.822     0.933     0.874        89
           1      0.922     0.798     0.855        89

    accuracy                          0.865       178
   macro avg      0.872     0.865     0.865       178
weighted avg      0.872     0.865     0.865       178

Confusion Matrix:
[[83  6]
 [18 71]]


Unnamed: 0,Title,Age_Fare,Fare,Sex,Pclass_Fare,Age,Pclass,Familysize,Embarked,Isalone,HasCabin
importance,0.199293,0.126601,0.125866,0.121126,0.114641,0.099181,0.0806,0.053799,0.039968,0.024753,0.014171
