# ***Titanic Survival Prediction***

---



 #  **Data Preprocessing**

Data preprocessing is crucial for preparing the dataset for modeling. This step involves handling missing values, dealing with outliers, converting categorical variables into numerical format, etc. Data cleaning ensures that the dataset is ready for analysis and modeling.



## Importing the libraries

We will start the task at hand by importing the important libraries.

In [361]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

## Importing the dataset

Here, we are uploading the dataset which is in the form of a csv file.



In [362]:
ds=pd.read_csv("Titanic-Dataset.csv")

Summary statistics:-



In [363]:
ds.head(n=10)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Cabin,Survived
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S,,0
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,C85,1
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S,,1
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S,C123,1
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S,,0
5,6,3,"Moran, Mr. James",male,,0,0,330877,8.4583,Q,,0
6,7,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,S,E46,0
7,8,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,S,,0
8,9,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,S,,1
9,10,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,C,,1


In [364]:
ds.describe(include='all')

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Cabin,Survived
count,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,889,204,891.0
unique,,,891,2,,,,681.0,,3,147,
top,,,"Braund, Mr. Owen Harris",male,,,,347082.0,,S,B96 B98,
freq,,,1,577,,,,7.0,,644,4,
mean,446.0,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,,0.383838
std,257.353842,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,,0.486592
min,1.0,1.0,,,0.42,0.0,0.0,,0.0,,,0.0
25%,223.5,2.0,,,20.125,0.0,0.0,,7.9104,,,0.0
50%,446.0,3.0,,,28.0,0.0,0.0,,14.4542,,,0.0
75%,668.5,3.0,,,38.0,1.0,0.0,,31.0,,,1.0


In [365]:
ds.dtypes

Unnamed: 0,0
PassengerId,int64
Pclass,int64
Name,object
Sex,object
Age,float64
SibSp,int64
Parch,int64
Ticket,object
Fare,float64
Embarked,object


## Taking care of missing data


Let's check for missing values in the dataset.

In [366]:
missing_values = ds.isnull().sum()
missing_percentage = (ds.isnull().sum() / len(ds)) * 100
missing = pd.DataFrame({'Missing Values': missing_values, 'Percentage': missing_percentage})
print(missing)

             Missing Values  Percentage
PassengerId               0    0.000000
Pclass                    0    0.000000
Name                      0    0.000000
Sex                       0    0.000000
Age                     177   19.865320
SibSp                     0    0.000000
Parch                     0    0.000000
Ticket                    0    0.000000
Fare                      0    0.000000
Embarked                  2    0.224467
Cabin                   687   77.104377
Survived                  0    0.000000


We can see that almost 77% of data is missing in the 'Cabin' column and thus, we will drop this column.

We will fill in the missing values of the 'Age' column with the median value.

The mode represents the most frequent value in a column and we are going to fill the missing values of the 'Embarked' column, which is a catagorical variable, with its mode.

In [367]:
ds=ds.drop(columns='Cabin')

ds['Age'].fillna(value= ds['Age'].median(), inplace=True)

ds['Embarked'].fillna(value= ds['Embarked'].mode()[0], inplace=True)

Now let us check if the missing values have been taken care of.

In [368]:
missing_values = ds.isnull().sum()
missing_percentage = (ds.isnull().sum() / len(ds)) * 100
missing = pd.DataFrame({'Missing Values': missing_values, 'Percentage': missing_percentage})
print(missing)

             Missing Values  Percentage
PassengerId               0         0.0
Pclass                    0         0.0
Name                      0         0.0
Sex                       0         0.0
Age                       0         0.0
SibSp                     0         0.0
Parch                     0         0.0
Ticket                    0         0.0
Fare                      0         0.0
Embarked                  0         0.0
Survived                  0         0.0


As we can see that there are no more missing values, we can move to the next step.

## Encoding categorial data

First, we will be splitting the dataset into the independent variable matrix (X) and dependent variable vector (y).
y consists of the target variable which is 'Survived'.
X should consist of the variables in the dataset that will help the model to make a good prediction of the target variable.
Thus, X will be consisting of all the remaining columns except the 'Name' and 'Ticket' columns as they are unlikely to yield any useful information.

In [369]:
ds=ds.drop(columns=['Name','Ticket'])
X=ds.iloc[:,0:-1].values
y=ds.iloc[:,-1].values

Independent Variable Matrix

In [370]:
Xprint=pd.DataFrame(X)
Xprint.columns=['PassengerId','Pclass','Sex','Age','SibSp','Parch','Fare','Embarked']
Xprint.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,3,male,22.0,1,0,7.25,S
1,2,1,female,38.0,1,0,71.2833,C
2,3,3,female,26.0,0,0,7.925,S
3,4,1,female,35.0,1,0,53.1,S
4,5,3,male,35.0,0,0,8.05,S


Target Variable

In [371]:
yprint=pd.DataFrame(y)
yprint.columns=['Survived']
yprint.head()

Unnamed: 0,Survived
0,0
1,1
2,1
3,1
4,0


### Encoding the independent variable matrix

Now we will transform categorical variables into a form that could be provided to the ML algorithms to do a better prediction.

In [372]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct=ColumnTransformer(transformers=[('encoder', OneHotEncoder(drop="first"),[2,-1])],remainder='passthrough')
X=np.array(ct.fit_transform(X))

In [373]:
print(pd.DataFrame(X).head())

     0    1    2  3  4     5  6  7        8
0  1.0  0.0  1.0  1  3  22.0  1  0     7.25
1  0.0  0.0  0.0  2  1  38.0  1  0  71.2833
2  0.0  0.0  1.0  3  3  26.0  0  0    7.925
3  0.0  0.0  1.0  4  1  35.0  1  0     53.1
4  1.0  0.0  1.0  5  3  35.0  0  0     8.05


## Splitting the data into test set and training set

In [374]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.20)

## Feature Scaling

In [375]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
X_train= sc.fit_transform(X_train)
X_test=sc.transform(X_test)

# Model Training

With the preprocessed dataset, we proceed to select machine learning models for classification. We train various models (e.g., logistic regression, decision trees, random forests, etc), and evaluate their performance using metrics like accuracy, precision, recall, and F1-score.

## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
classifier_logistic=LogisticRegression()
classifier_logistic.fit(X_train, y_train)

Predicting Test Set Results

In [377]:
y_test_pred_lr=classifier_logistic.predict(X_test)

Confusion Matrix

In [378]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
cm=confusion_matrix(y_test,y_test_pred_lr)
print(cm)

[[91 15]
 [22 51]]


Accuracy score

In [379]:
accuracy_score(y_test,y_test_pred_lr)

0.7932960893854749

Other Model Performance Metrics

In [380]:
print(classification_report(y_test, y_test_pred_lr))

              precision    recall  f1-score   support

           0       0.81      0.86      0.83       106
           1       0.77      0.70      0.73        73

    accuracy                           0.79       179
   macro avg       0.79      0.78      0.78       179
weighted avg       0.79      0.79      0.79       179



## Decision Tree Classification

In [None]:
from sklearn.tree import DecisionTreeClassifier
classifier_decisiontree=DecisionTreeClassifier(criterion="entropy")
classifier_decisiontree.fit(X_train, y_train)

Predicting Test Set Results

In [382]:
y_test_pred_dt=classifier_decisiontree.predict(X_test)

Confusion Matrix

In [383]:
cm=confusion_matrix(y_test,y_test_pred_dt)
print(cm)

[[83 23]
 [25 48]]


Accuracy Score

In [384]:
accuracy_score(y_test,y_test_pred_dt)

0.7318435754189944

Other Model Performance Metrics

In [385]:
print(classification_report(y_test, y_test_pred_dt))

              precision    recall  f1-score   support

           0       0.77      0.78      0.78       106
           1       0.68      0.66      0.67        73

    accuracy                           0.73       179
   macro avg       0.72      0.72      0.72       179
weighted avg       0.73      0.73      0.73       179



## Random Forest Classification

In [None]:
from sklearn.ensemble import RandomForestClassifier
classifier_rf = RandomForestClassifier(n_estimators=100, criterion='entropy')
classifier_rf.fit(X_train, y_train)

Predicting Test Set Results

In [387]:
y_test_pred_rf=classifier_rf.predict(X_test)

Confusion Matrix

In [388]:
cm=confusion_matrix(y_test,y_test_pred_rf)
print(cm)

[[97  9]
 [21 52]]


Accuracy Score

In [389]:
accuracy_score(y_test, y_test_pred_rf)

0.8324022346368715

Other Performance Metrics

In [390]:
print(classification_report(y_test,y_test_pred_rf))

              precision    recall  f1-score   support

           0       0.82      0.92      0.87       106
           1       0.85      0.71      0.78        73

    accuracy                           0.83       179
   macro avg       0.84      0.81      0.82       179
weighted avg       0.83      0.83      0.83       179



## K-Nearest Neighbor(K-NN) Algorithm

In [None]:
from sklearn.neighbors import KNeighborsClassifier
classifier_knn = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
classifier_knn.fit(X_train, y_train)

Predicting Test Set Results

In [392]:
y_test_pred_knn=classifier_knn.predict(X_test)

Confusion Matrix

In [393]:
cm=confusion_matrix(y_test,y_test_pred_knn)
print(cm)

[[95 11]
 [22 51]]


Accuracy Score

In [394]:
accuracy_score(y_test,y_test_pred_knn)

0.8156424581005587

Other Model Performance Metrics

In [395]:
print(classification_report(y_test, y_test_pred_knn))

              precision    recall  f1-score   support

           0       0.81      0.90      0.85       106
           1       0.82      0.70      0.76        73

    accuracy                           0.82       179
   macro avg       0.82      0.80      0.80       179
weighted avg       0.82      0.82      0.81       179



## Naive Bayes Classification

In [None]:
from sklearn.naive_bayes import GaussianNB
classifier_nb= GaussianNB()
classifier_nb.fit(X_train,y_train)

Predicting Test Set Results

In [397]:
y_test_pred_nb= classifier_nb.predict(X_test)

Confusion Matrix

In [398]:
cm=confusion_matrix(y_test, y_test_pred_nb)
print(cm)

[[90 16]
 [19 54]]


Accuracy Score

In [399]:
accuracy_score(y_test,y_test_pred_nb)

0.8044692737430168

Other Performance Metrics

In [400]:
print(classification_report(y_test, y_test_pred_nb))

              precision    recall  f1-score   support

           0       0.83      0.85      0.84       106
           1       0.77      0.74      0.76        73

    accuracy                           0.80       179
   macro avg       0.80      0.79      0.80       179
weighted avg       0.80      0.80      0.80       179



## Support Vector Machine (SVM) Algorithm

In [None]:
from sklearn.svm import SVC
classifier_svm=SVC(kernel='rbf')
classifier_svm.fit(X_train,y_train)

Predicting Test Set Results

In [402]:
y_test_pred_svm=classifier_svm.predict(X_test)

Confusion Matrix

In [403]:
cm = confusion_matrix(y_test, y_test_pred_svm)
print(cm)

[[98  8]
 [21 52]]


Accuracy Score

In [404]:
accuracy_score(y_test,y_test_pred_svm)

0.8379888268156425

Other Performance Metrics

In [405]:
print(classification_report(y_test, y_test_pred_svm))

              precision    recall  f1-score   support

           0       0.82      0.92      0.87       106
           1       0.87      0.71      0.78        73

    accuracy                           0.84       179
   macro avg       0.85      0.82      0.83       179
weighted avg       0.84      0.84      0.83       179



# Conclusion

The best results were obtained from **SVM Algorithm** with an accuracy of **84%**.

The precision of the model is 87% which means that out of all the passengers that the model predicted would survive, only 87% actually did.

The recall of the model is 71% which indicates that out of all the passengers that actually survived, the SVM model only predicted this outcome correctly for 71% of those passesngers.

Moreover, the F1 score of the model comes out to be 0.78 which is moderately close to 1. This tells us that the model does a decent job of predicting whether or not passengers survived.