# Titanic Survival Prediction Using Naive Bayes

Prepared by : ADRIAN BAHTIAR (000509890)

In this notebook, I use the Naive Bayes algorithm to predict Titanic passenger survival. I cover data preprocessing, model training, and evaluation, focusing on accuracy and other key metrics.

## Import Libraries

I'm starting by importing the necessary libraries for data manipulation, visualization, and modeling.

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # data visualization
import warnings

warnings.filterwarnings('ignore')
plt.style.use('ggplot')


## Load the Data

Next, I'll load the Titanic dataset into a Pandas DataFrame for analysis.

In [2]:

# Load the data
titanic_data = pd.read_csv('titanic-dataset.csv')


## Initial Data Exploration

Before diving into preprocessing, I want to understand the structure, data types, and basic statistics of the dataset.

In [3]:

# Information about the dataset
data_info = titanic_data.info()

# Statistical summary
data_stats = titanic_data.describe()

# Check for missing values
missing_values = titanic_data.isnull().sum()

data_info, data_stats, missing_values


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Name         891 non-null    object 
 2   Pclass       891 non-null    int64  
 3   Sex          891 non-null    object 
 4   Age          714 non-null    float64
 5   SibSp        891 non-null    int64  
 6   Parch        891 non-null    int64  
 7   Ticket       891 non-null    object 
 8   Fare         891 non-null    float64
 9   Cabin        204 non-null    object 
 10  Embarked     889 non-null    object 
 11  Survived     891 non-null    int64  
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


(None,
        PassengerId      Pclass         Age       SibSp       Parch  \
 count   891.000000  891.000000  714.000000  891.000000  891.000000   
 mean    446.000000    2.308642   29.699118    0.523008    0.381594   
 std     257.353842    0.836071   14.526497    1.102743    0.806057   
 min       1.000000    1.000000    0.420000    0.000000    0.000000   
 25%     223.500000    2.000000   20.125000    0.000000    0.000000   
 50%     446.000000    3.000000   28.000000    0.000000    0.000000   
 75%     668.500000    3.000000   38.000000    1.000000    0.000000   
 max     891.000000    3.000000   80.000000    8.000000    6.000000   
 
              Fare    Survived  
 count  891.000000  891.000000  
 mean    32.204208    0.383838  
 std     49.693429    0.486592  
 min      0.000000    0.000000  
 25%      7.910400    0.000000  
 50%     14.454200    0.000000  
 75%     31.000000    1.000000  
 max    512.329200    1.000000  ,
 PassengerId      0
 Name             0
 Pclass       

## Data Preprocessing

This section covers all the data processing I do.

### Drop Insignificant Columns

I'll remove columns that are not likely to contribute to the predictive model. These include 'PassengerId', 'Name', 'Ticket', and 'Cabin'.

In [4]:

# Drop columns that are not significant for the analysis
titanic_data.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)


### Handle Missing Values

Handling missing values is crucial. I'll impute missing 'Age' values with the median and 'Embarked' with the most frequent value.

In [5]:

# Impute missing 'Age' with the median
titanic_data['Age'].fillna(titanic_data['Age'].median(), inplace=True)

# Impute missing 'Embarked' with the mode
titanic_data['Embarked'].fillna(titanic_data['Embarked'].mode()[0], inplace=True)

# Check for any more missing values
titanic_data.isnull().sum()


Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
Survived    0
dtype: int64

### One-Hot Encoding

I'll convert categorical variables into a format that could be better fed into machine learning algorithms using one-hot encoding.

In [6]:

# Perform one-hot encoding on categorical variables
titanic_data = pd.get_dummies(titanic_data, columns=['Pclass', 'Sex', 'Embarked'], drop_first=True)


### Feature Scaling

I'll scale the 'Age' and 'Fare' features to standardize their ranges.

In [7]:

from sklearn.preprocessing import StandardScaler

# Initialize the Standard Scaler
scaler = StandardScaler()

# Columns to scale
scale_cols = ['Age', 'Fare']

# Perform feature scaling
titanic_data[scale_cols] = scaler.fit_transform(titanic_data[scale_cols])


## Model Training and Evaluation

Finally, I'll use a Naive Bayes classifier to make survival predictions. Then, I'll evaluate the model's performance using metrics like accuracy, confusion matrix, and classification report.

I've chosen to use accuracy as my primary evaluation metric. Accuracy tells me the ratio of correctly predicted instances out of the total instances in the dataset. It's a great starting point for assessing how well my model is doing. However, I'm also aware that accuracy alone might not give the full picture, especially when the classes are imbalanced. That's why I've also looked at additional metrics like the confusion matrix and the classification report.

In [11]:

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Splitting the data
X = titanic_data.drop('Survived', axis=1)
y = titanic_data['Survived']

# 70% training and 30% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize Naive Bayes Classifier
gnb = GaussianNB()

# Fit the model
gnb.fit(X_train, y_train)

# Make predictions
y_pred = gnb.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)



In [16]:
print('accuracy:',accuracy)


accuracy: 0.7835820895522388


In [17]:
print('conf_matrix:',conf_matrix)
print('class_report:',class_report)

conf_matrix: [[128  29]
 [ 29  82]]
class_report:               precision    recall  f1-score   support

           0       0.82      0.82      0.82       157
           1       0.74      0.74      0.74       111

    accuracy                           0.78       268
   macro avg       0.78      0.78      0.78       268
weighted avg       0.78      0.78      0.78       268

