# Miller Building a Classifier
**Author:** Dan Miller

**Date:** November 2nd, 2025

**Objective:** Build and evaluate three classifiers using the Titanic dataset, then compare their performance across different feature sets in terms of predicting survival

## Introduction
This project explores the difference in performance between three classifiers: Decision Tree, Support Vector Machine, and Neural Network.  These classifiers will be made on the Titanic dataset to predict the feature 'survived'.  First, the data will be explored and there will be feature engineering done.  After that, each classifier will be made individually on three separate feature sets.  After all three classifiers are made and compared, there will be a summary at the end to discuss the findings.

## Section 1. Import and Inspect the Data

### 1.1 Import the necessary libraries

In [61]:
# Imports

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier, plot_tree

### 1.2 Load the dataset and display a few records

In [62]:
# Load Titanic dataset
titanic = sns.load_dataset("titanic")

titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## Section 2. Data Exploration and Preparation

### 2.1 Handle Missing Values and Clean Data

The titanic dataset was already thoroughly explored in ml02_miller.ipynb, so we already know what needs to be done to the data.

In [63]:
# Impute missing values for age using the median

median_age = titanic["age"].median()
titanic["age"] = titanic["age"].fillna(median_age)

In [64]:
# Fill in missing embark_town values using the mode

mode_embark = titanic["embark_town"].mode()[0]
titanic["embark_town"] = titanic["embark_town"].fillna(mode_embark)

### 2.2 Feature Engineering

In [65]:
# Create new feature: family_size
titanic["family_size"] = titanic["sibsp"] + titanic["parch"] + 1

# Map categories to numeric values

titanic["sex"] = titanic["sex"].map({"male": 0, "female": 1})
titanic["embarked"] = titanic["embarked"].map({"C": 0, "Q": 1, "S": 2})

# Convert 'alone' to numeric binary
titanic["alone"] = titanic["alone"].astype(int)

## Section 3. Feature Selection and Justification

### 3.1 Choose features and target

While 'survived' will always be the target, there will be three different input cases:

Case 1:
- input feature: alone
- target: survived

Case 2:
- input features: age
- target: survived

Case 3:
- input features: age, family_size
- target: survived

### 3.2 Define X and y

In [66]:
# Case 1: Feature = alone

X1 = titanic[["alone"]]

y1 = titanic["survived"]

In [67]:
# Case 2: Feature = age (drop if na or not available)

X2 = titanic[["age"]].dropna()

y2 = titanic.loc[X2.index, "survived"]

In [68]:
# Case 3: Features = age & family_size (drop if na or not available)

X3 = titanic[["age", "family_size"]].dropna()

y3 = titanic.loc[X3.index, "survived"]

### Reflection 3:

1) Why are these features selected? **All of these features are somewhat related to each other, age, being alone, and the size of your family can all have feature interaction.  These relationships can give our future ML models signals for separating survivors from non-survivors.  For instance, a child that isn't alone most likely has a better chance of survival than an adult who is alone.**

2) Are there features that are likely to be highly predictive of survival? **Of the features we chose, age might have the highest chance of being highly predictive of survival, as children had a much higher survival chance.**

## Section 4. Train a Classification Model (Decision Tree)

### 4.1 Split the Data

In [69]:
# Case 1: Feature = alone

splitter1 = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=123)
for train_idx1, test_idx1 in splitter1.split(X1, y1):
    X1_train, X1_test = X1.iloc[train_idx1], X1.iloc[test_idx1]
    y1_train, y1_test = y1.iloc[train_idx1], y1.iloc[test_idx1]

print("Case 1 - Alone:")
print("Train size:", len(X1_train), "| Test size:", len(X1_test))

Case 1 - Alone:
Train size: 712 | Test size: 179


In [70]:
# Case 2: Feature = age

splitter2 = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=123)
for train_idx2, test_idx2 in splitter2.split(X2, y2):
    X2_train, X2_test = X2.iloc[train_idx2], X2.iloc[test_idx2]
    y2_train, y2_test = y2.iloc[train_idx2], y2.iloc[test_idx2]

print("Case 2 - Age:")
print("Train size:", len(X2_train), "| Test size:", len(X2_test))

Case 2 - Age:
Train size: 712 | Test size: 179


In [71]:
# Case 3: Features = age & family_size

splitter3 = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=123)
for train_idx3, test_idx3 in splitter3.split(X3, y3):
    X3_train, X3_test = X3.iloc[train_idx3], X3.iloc[test_idx3]
    y3_train, y3_test = y3.iloc[train_idx3], y3.iloc[test_idx3]

print("Case 3 - Age & Family Size:")
print("Train size:", len(X3_train), "| Test size:", len(X3_test))

Case 3 - Age & Family Size:
Train size: 712 | Test size: 179


### Create and Train Model (Decision Tree)

In [72]:
# Case 1: Decision Tree using alone

tree_model1 = DecisionTreeClassifier()
tree_model1.fit(X1_train, y1_train)

0,1,2
,criterion,'gini'
,splitter,'best'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,
,max_leaf_nodes,
,min_impurity_decrease,0.0


In [73]:
# Case 2: Decision Tree using age

tree_model2 = DecisionTreeClassifier()
tree_model2.fit(X2_train, y2_train)

0,1,2
,criterion,'gini'
,splitter,'best'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,
,max_leaf_nodes,
,min_impurity_decrease,0.0


In [74]:
# Case 3: Decision Tree using age & family_size

tree_model3 = DecisionTreeClassifier()
tree_model3.fit(X3_train, y3_train)

0,1,2
,criterion,'gini'
,splitter,'best'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,
,max_leaf_nodes,
,min_impurity_decrease,0.0


### Predict and Evaluate Model Performance

In [75]:
# Case 1
# Predict on training data
y1_pred = tree_model1.predict(X1_train)
print("Results for Decision Tree on training data (Case 1 - alone):")
print(classification_report(y1_train, y1_pred))

# Predict on test data
y1_test_pred = tree_model1.predict(X1_test)
print("Results for Decision Tree on test data (Case 1 - alone):")
print(classification_report(y1_test, y1_test_pred))

Results for Decision Tree on training data (Case 1 - alone):
              precision    recall  f1-score   support

           0       0.69      0.69      0.69       439
           1       0.50      0.51      0.51       273

    accuracy                           0.62       712
   macro avg       0.60      0.60      0.60       712
weighted avg       0.62      0.62      0.62       712

Results for Decision Tree on test data (Case 1 - alone):
              precision    recall  f1-score   support

           0       0.71      0.65      0.68       110
           1       0.51      0.58      0.54        69

    accuracy                           0.63       179
   macro avg       0.61      0.62      0.61       179
weighted avg       0.64      0.63      0.63       179



In [76]:
# Case 2
# Predict on training data
y2_pred = tree_model2.predict(X2_train)
print("Results for Decision Tree on training data (Case 2 - age):")
print(classification_report(y2_train, y2_pred))

# Predict on test data
y2_test_pred = tree_model2.predict(X2_test)
print("Results for Decision Tree on test data (Case 2 - age):")
print(classification_report(y2_test, y2_test_pred))

Results for Decision Tree on training data (Case 2 - age):
              precision    recall  f1-score   support

           0       0.68      0.92      0.78       439
           1       0.69      0.29      0.41       273

    accuracy                           0.68       712
   macro avg       0.68      0.61      0.60       712
weighted avg       0.68      0.68      0.64       712

Results for Decision Tree on test data (Case 2 - age):
              precision    recall  f1-score   support

           0       0.63      0.89      0.74       110
           1       0.50      0.17      0.26        69

    accuracy                           0.61       179
   macro avg       0.57      0.53      0.50       179
weighted avg       0.58      0.61      0.55       179



In [77]:
# Case 3
# Predict on training data
y3_pred = tree_model3.predict(X3_train)
print("Results for Decision Tree on training data (Case 3 - age & family_size):")
print(classification_report(y3_train, y3_pred))

# Predict on test data
y3_test_pred = tree_model3.predict(X3_test)
print("Results for Decision Tree on test data (Case 3 - age & family_size):")
print(classification_report(y3_test, y3_test_pred))

Results for Decision Tree on training data (Case 3 - age & family_size):
              precision    recall  f1-score   support

           0       0.77      0.90      0.83       439
           1       0.77      0.56      0.65       273

    accuracy                           0.77       712
   macro avg       0.77      0.73      0.74       712
weighted avg       0.77      0.77      0.76       712

Results for Decision Tree on test data (Case 3 - age & family_size):
              precision    recall  f1-score   support

           0       0.65      0.75      0.69       110
           1       0.46      0.35      0.40        69

    accuracy                           0.59       179
   macro avg       0.55      0.55      0.54       179
weighted avg       0.57      0.59      0.58       179

