# Lab 3 Project (Titanic)
Jason Ballard
31 March 2025

Import the external Python libraries used (e.g., pandas, numpy, matplotlib, seaborn, sklearn and more).

## Section 1. Import and Inspect the Data

In [1]:
# all imports get moved to the top - import each only once

import seaborn as sns
import pandas as pd
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn.model_selection import StratifiedShuffleSplit, train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder, StandardScaler
import numpy as np

In [2]:
# Load Titanic dataset
df = sns.load_dataset('titanic')

features = list(df.columns)
print(features)
print(len(features))

['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town', 'alive', 'alone']
15


In [3]:
# titanic.info()
# print(titanic.head(10))
# titanic.isnull().sum()
# print(titanic.describe())
# print(titanic.corr(numeric_only=True))

# Section 2. Data Exploration and Preparation

In [4]:
# attributes = ['age', 'fare', 'pclass']
# scatter_matrix(titanic[attributes], figsize=(10, 10))

In [5]:
# plt.scatter(titanic['age'], titanic['fare'], c=titanic['sex'].apply(lambda x: 0 if x == 'male' else 1))
# plt.xlabel('Age')
# plt.ylabel('Fare')
# plt.title('Age vs Fare by Gender')
# plt.show()

In [6]:
# #Create a histogram of age:

# sns.histplot(titanic['age'], kde=True)
# plt.title('Age Distribution')
# plt.show()

In [7]:
# #Create a count plot for class and survival:

# sns.countplot(x='class', hue='survived', data=titanic)
# plt.title('Class Distribution by Survival')
# plt.show()

<!-- ### Reflection 2.1:

1. What patterns or anomalies do you notice? Young to middle age passengers, majority found in third class
2. Do any features stand out as potential predictors? the deck location or fare price
3. Are there any visible class imbalances? There are huge class imbalances. Majority of the passengers where younger families traveling to the USA -->

## 2.1 Handle Missing Values and Clean Data

In [8]:
# Fill missing 'embark_town' values with the most common value (mode)
df['embark_town'] = df['embark_town'].fillna(df['embark_town'].mode()[0])  # Fill with mode (most frequent)

# Fill missing 'sex' values with 'unknown' and then map to numeric values
df['sex'] = df['sex'].fillna('unknown')  # Fill missing 'sex' with 'unknown'
df['sex'] = df['sex'].map({'male': 0, 'female': 1, 'unknown': -1})  # Map 'male'/'female'/'unknown' to 0/1/-1




## 2.2 Feature Engineering

In [9]:
# Create a new feature 'family_size' (sum of siblings/spouses and parents/children aboard)
df['family_size'] = df['sibsp'] + df['parch'] + 1  # +1 to include the passenger themselves

# Display the updated DataFrame
print(df)  # Or use df.head() to display the first few rows

     survived  pclass  sex   age  sibsp  parch     fare embarked   class  \
0           0       3    0  22.0      1      0   7.2500        S   Third   
1           1       1    1  38.0      1      0  71.2833        C   First   
2           1       3    1  26.0      0      0   7.9250        S   Third   
3           1       1    1  35.0      1      0  53.1000        S   First   
4           0       3    0  35.0      0      0   8.0500        S   Third   
..        ...     ...  ...   ...    ...    ...      ...      ...     ...   
886         0       2    0  27.0      0      0  13.0000        S  Second   
887         1       1    1  19.0      0      0  30.0000        S   First   
888         0       3    1   NaN      1      2  23.4500        S   Third   
889         1       1    0  26.0      0      0  30.0000        C   First   
890         0       3    0  32.0      0      0   7.7500        Q   Third   

       who  adult_male deck  embark_town alive  alone  family_size  
0      man        

### 2.2a Dynamic encoder embark_town

In [10]:
# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the 'embark_town' column
df['embark_town_encoded'] = label_encoder.fit_transform(df['embark_town'].fillna('Unknown'))  # Handling NaN as 'Unknown'

# Display the unique encoded values
print(df['embark_town_encoded'].unique())

# Display the updated DataFrame
print(df)  # Or use df.head() to display the first few rows


[2 0 1]
     survived  pclass  sex   age  sibsp  parch     fare embarked   class  \
0           0       3    0  22.0      1      0   7.2500        S   Third   
1           1       1    1  38.0      1      0  71.2833        C   First   
2           1       3    1  26.0      0      0   7.9250        S   Third   
3           1       1    1  35.0      1      0  53.1000        S   First   
4           0       3    0  35.0      0      0   8.0500        S   Third   
..        ...     ...  ...   ...    ...    ...      ...      ...     ...   
886         0       2    0  27.0      0      0  13.0000        S  Second   
887         1       1    1  19.0      0      0  30.0000        S   First   
888         0       3    1   NaN      1      2  23.4500        S   Third   
889         1       1    0  26.0      0      0  30.0000        C   First   
890         0       3    0  32.0      0      0   7.7500        Q   Third   

       who  adult_male deck  embark_town alive  alone  family_size  \
0      ma

<!-- ### Reflection 2.3

1. Why might family size be a useful feature for predicting survival? famil;y size is a good prediction of survivalbility for the female and younger children of the families
2. Why convert categorical data to numeric?  the conversion allows computations to be run on the data. -->

# Section 3. Feature Selection and Justification

- Select two or more input features (numerical for regression, numerical and/or categorical for classification)
- Use 'Survived' as the target

First:
- input features: alone
- target: survived

Second:
- input features - embark_town
- target: survived

Third:
- input features -  age and family_size (embark_town)
- target: survived
- Justify your selection with reasoning.

## 3.1 Choose features and target

In [11]:
# Select relevant features for classification
features = ['alone', 'age', 'embark_town_encoded', 'sex', 'family_size']
target = 'survived'

# Extract relevant columns
titanic_classification = df[features + [target]]

# Encode 'embark_town_encoded' (convert 'C', 'Q', 'S' to numeric values)
titanic_classification.loc[:, 'embark_town_encoded'] = titanic_classification['embark_town_encoded'].map({'C': 0, 'Q': 1, 'S': 2})

# Check for missing values before dropping rows (if any)
print(titanic_classification.isnull().sum())  # Check for remaining missing values

# Drop rows with missing target (survived) values if necessary
titanic_classification = titanic_classification.dropna(subset=[target])

# Ensure 'alone' column is an integer (handle missing or inconsistent values)
titanic_classification['alone'] = titanic_classification['alone'].fillna(0).astype(int)  # Assuming 'alone' is binary

# Display the processed dataset
print(titanic_classification.head())

alone                    0
age                    177
embark_town_encoded    891
sex                      0
family_size              0
survived                 0
dtype: int64
   alone   age  embark_town_encoded  sex  family_size  survived
0      0  22.0                  NaN    0            2         0
1      0  38.0                  NaN    1            2         1
2      1  26.0                  NaN    1            1         1
3      0  35.0                  NaN    1            2         1
4      1  35.0                  NaN    0            1         0


 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan na

## 3.2 Define X (features) and y (target)
- Assign input features to X a pandas DataFrame with 1 or more input features
- Assign target variable to y (as applicable) - a pandas Series with a single target feature
- Again - use comments to run a single case at a time

- The follow starts with only the statements needed for case 1. 
- Double brackets [[ ]]  makes a 2D DataFrame
- Single brackets [ ]  make a 1D Series

In [12]:
# Case 1 Assign input features to X = (alone)
X = df[['alone']]
# Assign target variable to y (as applicable)
y = df['survived']

# Check the shapes of X and y
print(X.shape)  # Should be (n_samples, 1)
print(y.shape)  # Should be (n_samples,)

(891, 1)
(891,)


In [13]:
# Case 2 Assign input features to X = embarked
X = df[['embark_town']]
# Assign target variable to y (as applicable)   
y = df['survived']

# Check the shapes of X and y
print(X.shape)  # Should be (n_samples, 1)
print(y.shape)  # Should be (n_samples,)

(891, 1)
(891,)


In [14]:
#  Case 3 Assign input features to X = 
X = df[['age', 'embark_town_encoded', 'family_size']]
# Assign target variable to y (as applicable)
y = df['survived']

# Check the shapes of X and y
print(X.shape)  # Should be (n_samples, 1)
print(y.shape)  # Should be (n_samples,)

(891, 3)
(891,)


### Reflection 3:

1. Why are these features selected? **the features selected provide the most tell of survivability**
2. Are there any features that are likely to be highly predictive of survival? **Yes age and class**

# Section 4. Train a Classification Model (Decision Tree)

## 4.1 Basic Train/Test split 

In [15]:
splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=123)

for train_indices, test_indices in splitter.split(X, y):
    X_train = X.iloc[train_indices]
    X_test = X.iloc[test_indices]
    y_train = y.iloc[train_indices]
    y_test = y.iloc[test_indices]

print('Train size: ', len(X_train), 'Test size: ', len(X_test))

Train size:  712 Test size:  179


## 4.2 Stratified Train/Test split

In [16]:
tree_model = DecisionTreeClassifier()
tree_model.fit(X_train, y_train)

## 4.3 Compare Results


In [17]:
# Compare the class distributions
print("Original Class Distribution:\n", y.value_counts(normalize=True))
print("Train Set Class Distribution:\n", y_train.value_counts(normalize=True))
print("Test Set Class Distribution:\n", y_test.value_counts(normalize=True))

Original Class Distribution:
 survived
0    0.616162
1    0.383838
Name: proportion, dtype: float64
Train Set Class Distribution:
 survived
0    0.616573
1    0.383427
Name: proportion, dtype: float64
Test Set Class Distribution:
 survived
0    0.614525
1    0.385475
Name: proportion, dtype: float64


### Reflection 4:

1. Why might stratification improve model performance? **This ensures that the data is equallly representivate across the whole data set.**
2. How close are the training and test distributions to the original dataset? **identical**
3. Which split method produced better class balance? **I am not sure because th enumbers are so close**

## Section 5. Compare Alternative Models (SVC, NN) 

In a Support Vector Machine, the kernel function defines how the algorithm transforms data to find a hyperplane that separates the classes. If the data is not linearly separable, changing the kernel can help the model find a better decision boundary.

SVC Kernel: Common Types

RBF (Radial Basis Function) – Most commonly used; handles non-linear data well (default)
Linear – Best for linearly separable data (straight line separation)
Polynomial – Useful when the data follows a curved pattern
Sigmoid – Similar to a neural network activation function; less common
Commenting the options in and out in the code can be helpful. The analyst decides which to use based on their understanding of the results. 

In [18]:
# RBF Kernel (default) - same as calling SVC()
svc_model = SVC(kernel='rbf')
svc_model.fit(X_train, y_train)

# Linear Kernel
svc_model = SVC(kernel='linear')
svc_model.fit(X_train, y_train)

# Polynomial Kernel (e.g., with degree=3)
svc_model = SVC(kernel='poly', degree=3)
svc_model.fit(X_train, y_train)

# Sigmoid Kernel
svc_model = SVC(kernel='sigmoid')
svc_model.fit(X_train, y_train)

ValueError: Input X contains NaN.
SVC does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values