# Titanic Survival Prediction

**Name:** Elen  
**Date:** March 18, 2025

## Introduction
In this project, we will build machine learning models to predict the survival of passengers on the Titanic. Using the Titanic dataset from Seaborn, we will train multiple classification models: Decision Tree Classifier, Support Vector Machine (SVM), and Neural Network (NN), evaluate their performance, and interpret the results. We will focus on various input features to predict the target variable, "survived."

The steps involve data cleaning, feature engineering, model training, performance evaluation, and comparisons. We will explore different feature combinations to observe how they affect the accuracy of the models.

## Importing Libraries

In this section, we will import the necessary Python libraries to perform data manipulation, model training, and evaluation. These libraries will help us load the Titanic dataset, handle missing values, perform machine learning tasks, and visualize results.


In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix

## Section 1: Import and Inspect the Data

In this section, we will load the Titanic dataset using the `seaborn` library, which provides easy access to the dataset. We'll perform a quick inspection of the data to understand its structure, including the number of rows and columns, data types, and any missing values.

### Load Titanic Dataset

We will use the `seaborn` library to load the Titanic dataset. This dataset includes information about passengers on the Titanic, including features like age, sex, class, and whether they survived.

In [3]:
# Load Titanic dataset
titanic = sns.load_dataset('titanic')

# Display the first few rows of the dataset
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## Section 2: Data Exploration and Preparation

### 2.1 Handle Missing Values

In this step, we will handle any missing values in the dataset. Specifically, we'll impute missing values for the `age` column using the median value of the column, and for the `embark_town` column using the mode (most frequent value).

In [18]:
# Fill missing 'age' values with the median (since it's a numerical column)
titanic['age'] = titanic['age'].fillna(titanic['age'].median())

# Fill missing 'embark_town' values with the mode (most frequent value)
titanic['embark_town'] = titanic['embark_town'].fillna(titanic['embark_town'].mode()[0])

# Handle 'deck' by filling with 'Unknown'
# First, check if 'deck' is a categorical column and set the categories before filling
if titanic['deck'].dtype.name == 'category':
    # Add 'Unknown' as a valid category
    titanic['deck'] = titanic['deck'].cat.add_categories(['Unknown'])

# Fill missing 'deck' values with 'Unknown'
titanic['deck'] = titanic['deck'].fillna('Unknown')

# After imputation, check if there are any remaining missing values
print("\nMissing values after all imputations:")
print(titanic.isnull().sum())


Missing values after all imputations:
survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
deck           0
embark_town    0
alive          0
alone          0
dtype: int64


## 2.2 Feature Engineering

### Creating New Features:

#### **Family Size**:
You have already created the `family_size` feature, which combines `sibsp` (siblings/spouses) and `parch` (parents/children) and adds 1 to account for the individual passenger.

In [24]:
titanic['family_size'] = titanic['sibsp'] + titanic['parch'] + 1

### Age Binning (Optional):
You can group passengers into different age categories, like child, adult, senior, etc. This can help capture patterns in the data better than using raw ages.

In [25]:
bins = [0, 12, 18, 60, 100]
labels = ['child', 'teenager', 'adult', 'senior']
titanic['age_group'] = pd.cut(titanic['age'], bins=bins, labels=labels)

### Create a new column 'age_group' based on the defined bins and labels
titanic['age_group'] = pd.cut(titanic['age'], bins=bins, labels=labels)

In [26]:
print(titanic[['age', 'age_group']].head())

    age age_group
0  22.0     adult
1  38.0     adult
2  26.0     adult
3  35.0     adult
4  35.0     adult


### Combining Pclass and Embarked (Optional)

You can create a new feature `class_embarked` to capture interactions between the two features, `Pclass` (passenger class) and `Embarked` (embarkation port). Here's how to do it:


In [31]:
# Replace NaN values in the 'embarked' column with a placeholder (e.g., 'Unknown')
titanic['embarked'] = titanic['embarked'].fillna('Unknown')

# Create a new feature by combining Pclass and Embarked
titanic['class_embarked'] = titanic['pclass'].astype(str) + "_" + titanic['embarked'].astype(str)

# Check the result
print(titanic[['pclass', 'embarked', 'class_embarked']].head())

   pclass embarked class_embarked
0       3  Unknown      3_Unknown
1       1  Unknown      1_Unknown
2       3  Unknown      3_Unknown
3       1  Unknown      1_Unknown
4       3  Unknown      3_Unknown


### 3. Model Training and Evaluation

#### 3.1 Split Data into Training and Test Sets
You should split the dataset into training and test sets so that you can train your model on the training set and evaluate its performance on the test set. This will help you avoid overfitting.


### 1. Check the Columns Available in the DataFrame

Before proceeding with model training, it's important to inspect the columns available in the dataset to ensure you're referring to the correct columns for feature selection and target variable.

You can check the columns in the DataFrame using the following code:

In [34]:
print(titanic.columns)

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone', 'family_size', 'age_group', 'class_embarked'],
      dtype='object')


### 2. Adjusting the Code

After inspecting the columns available in the dataset, we can adjust the code accordingly. The **`name`** column does not exist in the dataset, so we will update the script to exclude only the **`survived`** column (which is the target variable) when splitting the data into training and testing sets.

### Updated Script for Splitting Data:

In [38]:
from sklearn.model_selection import train_test_split

# Features and target variable
X = titanic.drop(['survived'], axis=1)  # Excluding the target column
y = titanic['survived']

# Split into training and test sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the shape of the resulting sets to confirm the split
print(f"Training data shape (X_train, y_train): {X_train.shape}, {y_train.shape}")
print(f"Test data shape (X_test, y_test): {X_test.shape}, {y_test.shape}")

Training data shape (X_train, y_train): (712, 17), (712,)
Test data shape (X_test, y_test): (179, 17), (179,)
