# Dowdle's Titanic Survival Prediction
**Author:** Brittany Dowdle  
**Date:** March 26, 2025  
**Objective:** Use the data you inspected, explored, and cleaned previously. Use 3 models to predict survival on the Titanic from various input features. Compare model performance.


## Introduction
This project uses the Titanic dataset to predict survival based on features such as class, sex, and family size. We will train multiple models, evaluate performance using key metrics, and create visualizations to interpret the results. We use three common classification models in this lab: Decision Tree Classifier (DT), Support Vector Machine (SVM), and Neural Network (NN).
****

## Imports
In the code cell below, import the necessary Python libraries for this notebook. All imports should be at the top of the notebook. 

In [2]:
# Import pandas for data manipulation and analysis (we might want to do more with it)
import pandas as pd

# Import pandas for data manipulation and analysis  (we might want to do more with it)
import numpy as np

from pandas.plotting import scatter_matrix

# Import matplotlib for creating static visualizations
import matplotlib.pyplot as plt
import matplotlib.cm as cm

# Import seaborn for statistical data visualization (built on matplotlib)
import seaborn as sns

# Import train_test_split for splitting data into training and test sets
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit

# Import LinearRegression for building a linear regression model
from sklearn.linear_model import LinearRegression

# Import performance metrics for model evaluation
from sklearn.metrics import root_mean_squared_error, mean_absolute_error, r2_score

****
## Section 1. Import and Inspect the Data

We don't need to inspect the data as we've already done that and are familiar with the data. 

In [3]:
# Load the data
titanic = sns.load_dataset('titanic')

****

## Section 2. Data Exploration and Preparation
We might need to clean it or do some feature engineering. Learning to figure out what you need is a key skill.

### 2.1 Handle Missing Values and Clean Data

- Impute missing values for age using the median.
- Fill in missing values for embark_town using the mode

In [8]:
# Impute missing values for age using the median 
titanic.fillna({'age': titanic['age'].median()}, inplace=True)

# Fill missing values for embark_town using the mode
titanic['embark_town'] = titanic['embark_town'].fillna(titanic['embark_town'].mode()[0])

### 2.2 Feature Engineering

- Add family_size - number of family members on board.
- Convert categorical "sex" to numeric.
- Convert categorical "embarked" to numeric.
- Binary feature - convert "alone" to numeric.

In [9]:
# Create family_size
titanic['family_size'] = titanic['sibsp'] + titanic['parch'] + 1

# Convert categorical to numeric
titanic['sex'] = titanic['sex'].map({'male': 0, 'female': 1})

# Convert categorical to numeric
titanic['embarked'] = titanic['embarked'].map({'C': 0, 'Q': 1, 'S': 2})

# Convert binary to numeric
titanic['alone'] = titanic['alone'].astype(int)

****

## Section 3. Feature Selection and Justification

### 3.1 Choose features and target
For classification you need a categorical target variable (e.g., gender, species). Select two or more input features.

>Target: survived
>
>Input features: age, fare, pclass, sex, family_size

### 3.2 Define X and y

- Assign input features to X
- Assign target variable to y (as applicable)

In [9]:
X = titanic[['age', 'fare', 'pclass', 'sex', 'family_size']]
y = titanic['survived']

### Reflection 3:

1) Why are these features selected? **Age captures the advantage of younger passengers, fare and class because typically first class costs more and had the highest survival rate, sex and family size for life boat prioritization.**
2) Are there any features that are likely to be highly predictive of survival? **Yes, sex first, as women represented a larger share of higher fares. Pclass next, with first-class passengers surviving at much higher rates than those in third class.**

****

## Section 4. Splitting
Split the data into training and test sets using train_test_split first and StratifiedShuffleSplit second. Compare.

### 4.1 Basic Train/Test split

In [10]:
# Split data into a training set and a test set
X_train_b, X_test_b, y_train_b, y_test_b = train_test_split(X, y, test_size=0.2, random_state=123)

# Show set sizes
print('Train size:', len(X_train_b))
print('Test size:', len(X_test_b))

Train size: 712
Test size: 179


### 4.2 Stratified Train/Test split

In [11]:
# Define how many splits, % of data for testing, and ensure reproducibility
splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=123)

# Split data into a training set and a test set
for train_indices, test_indices in splitter.split(X, y):
    X_train_s = X.iloc[train_indices]
    X_test_s = X.iloc[test_indices]
    y_train_s = y.iloc[train_indices]
    y_test_s = y.iloc[test_indices]

# Show set sizes
print('Train size:', len(X_train_s))
print('Test size:', len(X_test_s))

Train size: 712
Test size: 179


### 4.3 Compare Results

In [12]:
print("\nOriginal Class Distribution:\n", y.value_counts(normalize=True))
print("\nBasic Split Distribution - Train Set:\n", y_train_b.value_counts(normalize=True))
print("\nBasic Split Distribution - Test Set:\n", y_test_b.value_counts(normalize=True))
print("\nStratified Split Distribution - Train Set:\n", y_train_s.value_counts(normalize=True))
print("\nStratified Split Distribution - Test Set:\n", y_test_s.value_counts(normalize=True))


Original Class Distribution:
 survived
0    0.616162
1    0.383838
Name: proportion, dtype: float64

Basic Split Distribution - Train Set:
 survived
0    0.610955
1    0.389045
Name: proportion, dtype: float64

Basic Split Distribution - Test Set:
 survived
0    0.636872
1    0.363128
Name: proportion, dtype: float64

Stratified Split Distribution - Train Set:
 survived
0    0.616573
1    0.383427
Name: proportion, dtype: float64

Stratified Split Distribution - Test Set:
 survived
0    0.614525
1    0.385475
Name: proportion, dtype: float64


### Reflection 4:

1) Why might stratification improve model performance? **The dataset was imbalanced across class, and stratification ensures that both the training and test sets maintain the same class distribution as the original dataset. It helps the model learn from a more representative sample of the data, which leads to more reliable performance.**
2) How close are the training and test distributions to the original dataset? **Stratified - the distributions for both the training and test sets are very similar to the original proportions. Basic Split - some deviation in the test set, where Class 0 is slightly overrepresented.**
3) Which split method produced better class balance? **Stratified Split produced a better class balance because it preserved the original class proportions more accurately in both the training and test sets.**