# Titanic Dataset Prediction

# Introduction

The Titanic dataset is a famous dataset that contains the demographic information of the passengers including age, gender, and class, as well as information about the passenger's tickets and cabins.
The main goal of this project is to predict whether the passenger survived or not considering various characteristics by taking the target variable as survived from the dataset. The dataset is split into a training set and a testing set, and machine learning algorithms are used to predict survival on the testing set based on the training set.
The relevance of the Titanic survival prediction problem lies in the insights it can provide into the factors that influenced survival rates during the disaster by analysing the dataset and building predictive models.

# Literature Review

* In 2014 authors J. Wijaya and J. T. Agee published a paper "Machine Learning Techniques for Predictive Maintenance of Shipboard Systems," for titanic survival prediction using decision trees, random forests, and support vector machines, and found that random forests outperformed the other algorithms with an accuracy of 80.36%.
* In 2015 authors M. Manikandan and K. Balamurugan published a paper "Predicting Survival on the Titanic: A Comparison of Machine Learning Techniques," using decision trees, k-nearest neighbors, and logistic regression, in predicting the survival of passengers and got accuarcy of 80.58%.
* In 2017 author Adolfo Alvarez published a paper "Exploring the Titanic Dataset with R," using various statistical techniques in R and found that being female and having a higher cabin class were associated with higher survival rates.
* In 2019 authors O. Nedelcu and A. Ionescu published a paper "Titanic: Machine Learning from Disaster,"using logistic regression, support vector machines, and neural networks and ound that neural networks achieved the highest accuracy, with an accuracy of 79.9%.
* In 2020 authors S. Mishra and S. Singh a paper "Survival Prediction of Titanic Passengers Using Data Mining Techniques," using decision trees, random forests, and gradient boosting, and found that gradient boosting outperformed the other algorithms with an accuracy of 81.39%.

While different studies have found varying levels of accuracy in predicting the survival of passengers, most studies agree that factors such as gender and cabin class are important predictors of survival.

# 1. Importing Main Libraries

First step is importing necessary packages and libraries for data analysis and machine learning

In [None]:
# importing Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets

#importing classes and functions
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# importing Machine learning algorithms

from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.linear_model import Perceptron
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score


The packages imported are:
* pandas: for data manipulation and analysis
* numpy: for numerical operations
* matplotlib: for data visualization
* seaborn: for data visualization
* sklearn.datasets: for loading datasets for machine learning

The classes and functions imported from sklearn are various machine learning algorithms, evaluation metrics, and techniques for data preprocessing:
* train_test_split: for splitting the data into training and testing sets
* accuracy_score: for calculating the accuracy of the model
* precision_score: for calculating the precision of the model
* recall_score: for calculating the recall of the model
* f1_score: for calculating the F1 score of the model
* confusion_matrix: for creating a confusion matrix of the model's performance

The machine learning algorithms imported are:
* GaussianNB: Gaussian Naive Bayes algorithm
* LogisticRegression: Logistic Regression algorithm
* SVC: Support Vector Machine algorithm
* LinearSVC: Linear Support Vector Machine algorithm
* Perceptron: Perceptron algorithm
* DecisionTreeClassifier: Decision Tree algorithm
* RandomForestClassifier: Random Forest algorithm
* KNeighborsClassifier: K-Nearest Neighbors algorithm
* SGDClassifier: Stochastic Gradient Descent algorithm
* GradientBoostingClassifier: Gradient Boosting algorithm.


# 2. Data Preprocessing
Import and read the train and test titanic dataset.

In [None]:
#Import and read the train and test titanic dataset.
train=pd.read_csv('/kaggle/input/titanic-train/Titanic_train.csv')
test=pd.read_csv('/kaggle/input/titanic-test/Titanic_test.csv')


In [None]:
train.head()

train.head() : Shows first five coloums by default and if we required particular count of coloumns can give train.head(coloumn_number).

In [None]:
train.tail(3)

train.tail(3) gives the last 3 coloums.

In [None]:
train.describe()

train.describe() is a method in Python that is used to generate a statistical summary of a DataFrame. It provides a quick overview of the numerical variables in the dataset, including the count, mean, standard deviation, minimum, and maximum values, as well as the quartile values (25%, 50%, and 75%).

In [None]:
train.info()

train.info() is a method in Python that is used to display a concise summary of a DataFrame, including the data types of each column, the number of non-null values, and the memory usage of the DataFrame.

In [None]:
train.columns

train.columns is an attribute in Python that is used to return the column labels of a DataFrame in a array like data structure.

* PassengerId: The unique identifier assigned to each passenger.
* Survived: Whether the passenger survived (1) or not (0)
* Pclass: The passenger class (1 = 1st class, 2 = 2nd class, 3 = 3rd class)
* Name: The passenger's name
* Sex: The passenger's sex
* Age: The passenger's age in years
* SibSp: The number of siblings/spouses the passenger had onboard
* Parch: The number of parents/children the passenger had onboard
* Ticket: The passenger's ticket number
* Fare: The fare paid by the passenger
* Cabin: The cabin number assigned to the passenger
* Embarked: The port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton) 


In [None]:
train.shape

train.shape is an attribute in python used to return a tuple with first integer represents total number of rows and second integer represents the total number of coloumns.

In [None]:
test.shape

# 3. Data Cleaning

In [None]:
train.isnull().sum()

From the above summary it is clear that there is 177 age, 687 Cabin and 2 Embarked fields are missing in total.

In [None]:
test.isnull().sum()

From the above summary it is clear that there is 86 age, 327 Cabin and 1 Fare fields are missing in total.

# 3.1 Handling with missing coloumns.

Step 1: Dropping Cabin columns as this is not important for survival prediction.

In [None]:
train_df = train.drop(columns='Cabin', axis=1,inplace =True)

In [None]:
test_df = test.drop(columns='Cabin', axis=1,inplace =True)

Step 2: Replacing the missing values in the “Age” column with the mean value.

In [None]:
train['Age'].fillna(train['Age'].mean(), inplace=True)

In [None]:
test['Age'].fillna(test['Age'].mean(), inplace=True)

Step 3: Finding the mode value of the “Embarked” column as it will have occurred the maximum number of times and replacing the missing values in the “Embarked” column with mode value.

In [None]:
print(train['Embarked'].mode())
train['Embarked'].fillna(train['Embarked'].mode()[0], inplace=True)

Step 4 : Now replacing Fare missing value using Imputation technique Mean where we can use the mean fare value of the non-missing values to replace the missing fare values.can use the mean fare value of the non-missing values to replace the missing fare values.

In [None]:
mean_fare = test['Fare'].mean()
test['Fare'].fillna(mean_fare, inplace=True)

Step 5: checking the null function again to see all null values removed and cleaned succesfully.

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

# 3.2 Removing unnecessary columns
Now we can remove few more coloumns sucha as passengerId,Name and ticket that is not considered for survival prediction

In [None]:
train.drop(['Name', 'PassengerId', 'Ticket'], axis = 1, inplace = True)

In [None]:
test.drop(['Name', 'PassengerId', 'Ticket'], axis = 1, inplace = True)

In [None]:
train.columns

In [None]:
test.columns

# 4. Data Normalization

Converting sex and Embarked columns into numerical 

In [None]:
train.replace({'Sex':{'male':0,'female':1}, 'Embarked':{'S':0,'C':1,'Q':2}}, inplace=True)

In [None]:
test.replace({'Sex':{'male':0,'female':1}, 'Embarked':{'S':0,'C':1,'Q':2}}, inplace=True)

# 5. Data Visualization

# 5.1 Survival rate based on Gender

In [None]:
sns.barplot(x='Sex', y='Survived', data=train)
plt.xlabel('Gender')
plt.ylabel('Survival Rate')
plt.title('Survival Rate by Gender')
plt.show()

As predicted, females have a much higher chance of survival than males. 

# 5.2 Survival rate based on Pclass

In [None]:
# Calculate survival rate by Pclass
survival_rates = train.groupby('Pclass')['Survived'].mean().reset_index()
# Create a pie chart of survival rates
plt.pie(survival_rates['Survived'], labels=survival_rates['Pclass'], autopct='%1.1f%%')

As predicted, people with higher socioeconomic class had a higher rate of survival.

# 5.3 Survival Rate based on SibSp

In [None]:
sns.lineplot(x="SibSp", y="Survived", data=train)


People with no siblings or spouses were less to likely to survive than those with one or two. 

# 5.4 Survival Rate based on Embarked

In [None]:
sns.catplot(x="Embarked", y="Survived", data=train,  kind='point')
plt.show()

Passengers who embarked at Cherbourg(1) had a higher survival rate than those who embarked at Southampton(0) or Queenstown(2).

# 5.5 Histogram of Passenger Ages

In [None]:
sns.histplot(train['Age'], kde=False, bins=30)
plt.xlabel('Age')
plt.ylabel('Count')
plt.title('Passenger Ages')
plt.show()

The resulting histogram shows the distribution of passenger ages in the dataset, where the majority of passengers were between 20 and 40 years old. There were also a significant number of passengers under 20 years old and a smaller number of passengers over 60 years old.

# 5.6 Scatter plot for Age vs Fare.

In [None]:
plt.scatter(train['Age'], train['Fare'], c=train['Survived'], cmap='cool')
plt.xlabel('Age')
plt.ylabel('Fare')
plt.title('Age vs Fare (Survived)')
plt.colorbar(label='Survived')
plt.show()


The resulting scatter plot shows the relationship between 'Age' and 'Fare', where passengers who paid higher fares tend to be older. The plot also shows that there is no clear relationship between 'Age' and 'Survived', and that the survival rate is higher for passengers who paid higher fares.

# 5.7 Heat Map

In [None]:
sns.heatmap(train.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

This code creates a heatmap showing the correlation between different features in the dataset. It can be useful for identifying which features are most strongly correlated with survival.

For example, if we observe a dark red square between the "Sex" feature and the "Survived" feature, it means that there is a strong positive correlation between being female and surviving the Titanic disaster.

Similarly, if we observe a dark blue square between the "Pclass" feature and the "Survived" feature, it means that there is a negative correlation between being in a lower passenger class and surviving the Titanic disaster.

The features that are most strongly correlated with survival are "Sex", "Pclass", "Age", "Fare" and "Embarked". These features can be used to build a predictive model that can be used to predict the survival of passengers in future Titanic-like scenarios.

# 5.8 Pair Plot

In [None]:
sns.pairplot(train, hue="Survived", palette="Set1")

The resulting pairplot shows the pairwise relationships between the remaining columns in the dataset, and how they relate to survival.

# 6. Data Analyses

Here we have analyzed the data by checking the total number of passengers, the median age of passengers, the survival rate, and the mean fare paid by passengers.

In [None]:
# Get the total number of passengers
num_passengers = len(train)

# Calculate the survival rate
survival_rate = train["Survived"].mean()

# Calculate the median age of passengers
median_age = train["Age"].median()

# Calculate the mean fare paid by passengers
mean_fare = train["Fare"].mean()

# Print the results
print("Total number of passengers:", num_passengers)
print("Survival rate:", survival_rate)
print("Median age of passengers:", median_age)
print("Mean fare paid by passengers:", mean_fare)

print(train['Sex'].value_counts())
print(train['Embarked'].value_counts())

# Analyze numerical features
print(train[['Age', 'Fare']].describe())


# 7. Feature Extraction 

# 7.1 AgeGroup
Sort the ages into logical categories and created a new field in train data set as AgeGroup for prediction which age group had survived the most.

In [None]:
bins =  [0, 5, 12, 18, 24, 35, 60, np.inf]
labels = ['Baby', 'Child', 'Teenager', 'Student', 'Young Adult', 'Adult', 'Senior']
train['AgeGroup'] = pd.cut(train["Age"], bins, labels = labels)
test['AgeGroup'] = pd.cut(test["Age"], bins, labels = labels)

#draw a bar plot of Age vs. survival
sns.barplot(x="AgeGroup", y="Survived", data=train)
plt.show()

Babies are more likely to survive than any other age group.

In [None]:
age_title_mapping = {1: "Young Adult", 2: "Student", 3: "Adult", 4: "Baby", 5: "Adult", 6: "Adult"}

for x in range(len(train["AgeGroup"])):
    if train["AgeGroup"][x] == "Unknown":
        train["AgeGroup"][x] = age_title_mapping[train["Title"][x]]
        
for x in range(len(test["AgeGroup"])):
    if test["AgeGroup"][x] == "Unknown":
        test["AgeGroup"][x] = age_title_mapping[test["Title"][x]] 
        
age_mapping = {'Baby': 1, 'Child': 2, 'Teenager': 3, 'Student': 4, 'Young Adult': 5, 'Adult': 6, 'Senior': 7}
train['AgeGroup'] = train['AgeGroup'].map(age_mapping)
test['AgeGroup'] = test['AgeGroup'].map(age_mapping)

train.head()
test.head()


# 7.2 FamilySize

This line of code creates a new feature called 'FamilySize' in the Titanic train dataset, which represents the total number of family members (siblings, spouses, parents, and children) that a passenger has onboard the Titanic, including themselves. It is calculated by adding the 'SibSp' (number of siblings/spouses) feature and the 'Parch' (number of parents/children) feature, and adding 1 to include the passenger themselves.

For example, if a passenger has 1 sibling/spouse and 2 parents/children onboard, their 'FamilySize' would be 1+2+1=4.

This new feature can be useful in predicting survival on the Titanic, as passengers who are traveling with family members may have had different chances of survival compared to those who were traveling alone.

In [None]:
train['FamilySize'] = train['SibSp'] + train['Parch'] + 1
test['FamilySize'] = test['SibSp'] + test['Parch'] + 1

In [None]:
sns.barplot(x="FamilySize", y="Survived", data=train)
plt.show()

# 8. Choosing the best Models

# 8.1 Splitting the training data

Typically, we split the dataset into two subsets: a training set and a testing set. The training set is used to train the model, and the testing set is used to evaluate the performance of the model on new, unseen data.

To split the Titanic dataset, we can use Python's scikit-learn library

In [None]:
# Separate the target variable (Survived) from the input variables
x = train.drop(['Survived'], axis=1)
y = train["Survived"]

# Split the data into a training set and a testing set
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.22, random_state=0)

In the above code, we first separate the target variable (Survived) from the input variables using the drop method. We then use the train_test_split function to split the data into a training set and a testing set. The test_size parameter specifies the percentage of the data that should be allocated to the testing set, and the random_state parameter ensures that the split is reproducible.

# 8.2 Testing with different Clustering Mdels.

* K-means
* Hierarchical clustering
* DBSCAN

# K-means

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

# Generate some data
x, y = make_blobs(n_samples=500, centers=4, random_state=42)

# Try different number of clusters
#for n_clusters in range(2, 8): --> best model found with cluster number 4

# Create a KMeans model with n_clusters
kmeans = KMeans(n_clusters=4, random_state=42)

# Fit the model to the data
kmeans.fit(x)

# Compute the silhouette score
silhouette_avg_kmeans = round(silhouette_score(x, kmeans.labels_)*100, 2)
print("The average silhouette_score is :", silhouette_avg_kmeans)

# Visualize the clusters
plt.scatter(x[:, 0], x[:, 1], c=kmeans.labels_)
plt.show()

In [None]:
# elbow method to determine the optimal k
distortions = []
for i in range(1,11):
    km = KMeans(n_clusters=i)
    km.fit(x_train)
    distortions.append(km.inertia_)
    
# import libraries
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# model training
SVCmodel = SVC(kernel='linear')
SVCmodel.fit(x_train, y_train)
# plot elbow curve
K = range(1, 11)
plt.plot(K, distortions, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Distortion')
plt.title('The Elbow Method using Distortion')
plt.show()

silhouette_score is found by first selection different cluster values and found cluster 4 has highest score and then using that the average score has been analysed.

# Hierarchical clustering

In [None]:
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score

# Performing hierarchical clustering
agg_clustering = AgglomerativeClustering(n_clusters=3).fit(x)

# Computing silhouette score
silhouette_avg_h = round(silhouette_score(x, agg_clustering.labels_)*100 ,2)
print("The average silhouette_score is :", silhouette_avg_h)

# Visualize the clusters
plt.scatter(x[:, 0], x[:, 1], c=agg_clustering.labels_)
plt.show()

# DBSCAN

In [None]:
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score

# Performing DBSCAN clustering
dbscan = DBSCAN(eps=1, min_samples=3).fit(x)

# Computing silhouette score
silhouette_avg_dbscan = round(silhouette_score(x, dbscan.labels_)*100 ,2)
print("The average silhouette_score is :", silhouette_avg_dbscan)

# Visualize the clusters
plt.scatter(x[:, 0], x[:, 1], c=dbscan.labels_)
plt.show()

# 8.3 Testing with different Classification models

* Gaussian Naive Bayes
* Logistic Regression
* Support Vector Machines
* Perceptron
* Decision Tree Classifier
* Random Forest Classifier
* KNN or k-Nearest Neighbors
* Stochastic Gradient Descent
* Gradient Boosting Classifier

For each model, we set the model, fit it with 80% of our training data, predict for 20% of the training data and check the accuracy.

# Gaussian Naive Bayes
It's a probabilistic algorithm that makes predictions based on the probability of each input belonging to a certain class.

In [None]:
# Train the Gaussian Naive Bayes classifier
gaussian = GaussianNB()
gaussian.fit(x_train, y_train)

# Make predictions on the testing set
y_pred = gaussian.predict(x_test)

#KFOLD cross validation score
scores = cross_val_score(gaussian, x_train, y_train, cv=10, scoring = "accuracy")
print("Scores:", scores)
mean_gaussian =  round(scores.mean()* 100, 2)
print("Mean:", mean_gaussian)
print("Standard Deviation:", scores.std())

# Calculate the accuracy,confusion matrix,precesion,recall and f1 score of the model
acc_gaussian = round(accuracy_score(y_pred, y_test) * 100, 2)
cm_gaussian = confusion_matrix(y_test, y_pred) 
precision_gaussian = round(precision_score(y_test, y_pred) * 100, 2)
recall_gaussian = round(recall_score(y_test, y_pred) * 100, 2)
f1_gaussian = round(f1_score(y_test, y_pred) * 100, 2)

# print the results

print("Confusion matrix:\n", cm_gaussian)
print("Accuracy: {:.2f}".format(acc_gaussian))
print("Precision: {:.2f}".format(precision_gaussian))
print("Recall: {:.2f}".format(recall_gaussian))
print("F1 score: {:.2f}".format(f1_gaussian))

# Confusion matrix visualization

def confusionM_gaussian(y_true,y_predict,target_names):
    
#function for confusion matrix visualisation

    cMatrix = confusion_matrix(y_true,y_predict)
    df_cm = pd.DataFrame(cMatrix,index=target_names,columns=target_names)
    plt.figure(figsize = (6,4))
    cm = sns.heatmap(df_cm,annot=True,fmt="d")
    cm.yaxis.set_ticklabels(cm.yaxis.get_ticklabels(),rotation=90)
    cm.xaxis.set_ticklabels(cm.xaxis.get_ticklabels(),rotation=0)
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
# get the 3 class names
class_names = train.Survived.unique()

confusionM_gaussian(y_test,y_pred,class_names)

# Logistic Regression
Logistic Regression is a popular machine learning algorithm for classification tasks. It's a linear model that makes predictions based on the probability of each input belonging to a certain class.

In [None]:
# Train the Logistic Regression classifier
logreg = LogisticRegression(max_iter=1000, random_state=42)
logreg.fit(x_train, y_train)

# Make predictions on the testing set
y_pred = logreg.predict(x_test)

#KFOLD cross validation score
scores = cross_val_score(logreg, x_train, y_train, cv=10, scoring = "accuracy")
print("Scores:", scores)
mean_logreg =  round(scores.mean()* 100, 2)
print("Mean:", mean_logreg)
print("Standard Deviation:", scores.std())

# Calculate the accuracy,confusion matrix,precesion,recall and f1 score of the model
acc_logreg = round(accuracy_score(y_pred, y_test) * 100, 2)
cm_logreg = confusion_matrix(y_test, y_pred) 
precision_logreg = round(precision_score(y_test, y_pred) * 100, 2)
recall_logreg = round(recall_score(y_test, y_pred) * 100, 2)
f1_logreg = round(f1_score(y_test, y_pred) * 100, 2)

# print the results
print("Confusion matrix:\n", cm_logreg)
print("Accuracy: {:.2f}".format(acc_logreg))
print("Precision: {:.2f}".format(precision_logreg))
print("Recall: {:.2f}".format(recall_logreg))
print("F1 score: {:.2f}".format(f1_logreg))


# Confusion matrix visualization

def confusionM_logreg(y_true,y_predict,target_names):
    
#function for confusion matrix visualisation

    cMatrix = confusion_matrix(y_true,y_predict)
    df_cm = pd.DataFrame(cMatrix,index=target_names,columns=target_names)
    plt.figure(figsize = (6,4))
    cm = sns.heatmap(df_cm,annot=True,fmt="d")
    cm.yaxis.set_ticklabels(cm.yaxis.get_ticklabels(),rotation=90)
    cm.xaxis.set_ticklabels(cm.xaxis.get_ticklabels(),rotation=0)
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
# get the 3 class names
class_names = train.Survived.unique()

confusionM_logreg(y_test,y_pred,class_names)

# Support Vector Machines


SVM model comparison for the Titanic dataset involves comparing multiple SVM models with different hyperparameters, feature selections, or pre-processing techniques to find the best model for predicting whether a passenger survived or not.

In [None]:
# Train the support vector machines classifier
svc = SVC()
svc.fit(x_train, y_train)

# Make predictions on the testing set
y_pred = svc.predict(x_test)

#KFOLD cross validation score
scores = cross_val_score(svc, x_train, y_train, cv=10, scoring = "accuracy")
print("Scores:", scores)
mean_svc =  round(scores.mean()* 100, 2)
print("Mean:", mean_svc)
print("Standard Deviation:", scores.std())


# Calculate the accuracy,confusion matrix,precesion,recall and f1 score of the model
acc_svc = round(accuracy_score(y_pred, y_test) * 100, 2)
cm_svc = confusion_matrix(y_test, y_pred) 
precision_svc = round(precision_score(y_test, y_pred) * 100, 2)
recall_svc = round(recall_score(y_test, y_pred) * 100, 2)
f1_svc = round(f1_score(y_test, y_pred) * 100, 2)

# print the results
print("Confusion matrix:\n", cm_svc)
print("Accuracy: {:.2f}".format(acc_svc))
print("Precision: {:.2f}".format(precision_svc))
print("Recall: {:.2f}".format(recall_svc))
print("F1 score: {:.2f}".format(f1_svc))

# Confusion matrix visualization

def confusionM_svc(y_true,y_predict,target_names):
    
#function for confusion matrix visualisation

    cMatrix = confusion_matrix(y_true,y_predict)
    df_cm = pd.DataFrame(cMatrix,index=target_names,columns=target_names)
    plt.figure(figsize = (6,4))
    cm = sns.heatmap(df_cm,annot=True,fmt="d")
    cm.yaxis.set_ticklabels(cm.yaxis.get_ticklabels(),rotation=90)
    cm.xaxis.set_ticklabels(cm.xaxis.get_ticklabels(),rotation=0)
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
# get the 3 class names
class_names = train.Survived.unique()

confusionM_svc(y_test,y_pred,class_names)

# Linear SVC

LinearSVC is a popular choice for classification problems that involve linearly separable data. It is easy to use, computationally efficient, and can handle large datasets. 

In [None]:
# Train the linear SVC classifier
linear_svc = LinearSVC(max_iter=1000, random_state=0,dual=False)
linear_svc.fit(x_train, y_train)

# Make predictions on the testing set
y_pred = linear_svc.predict(x_test)

#KFOLD cross validation score
scores = cross_val_score(linear_svc, x_train, y_train, cv=10, scoring = "accuracy")
print("Scores:", scores)
mean_lsvc =  round(scores.mean()* 100, 2)
print("Mean:", mean_lsvc)
print("Standard Deviation:", scores.std())

# Calculate the accuracy,confusion matrix,precesion,recall and f1 score of the model
acc_linear_svc = round(accuracy_score(y_pred, y_test) * 100, 2)
cm_linear_svc = confusion_matrix(y_test, y_pred) 
precision_linear_svc = round(precision_score(y_test, y_pred) * 100, 2)
recall_linear_svc = round(recall_score(y_test, y_pred) * 100, 2)
f1_linear_svc = round(f1_score(y_test, y_pred) * 100, 2)

# print the results
print("Confusion matrix:\n", cm_linear_svc)
print(acc_linear_svc)
print("Precision: {:.2f}".format(precision_linear_svc))
print("Recall: {:.2f}".format(recall_linear_svc))
print("F1 score: {:.2f}".format(f1_linear_svc))

# Confusion matrix visualization

def confusionM_linear_svc(y_true,y_predict,target_names):
    
#function for confusion matrix visualisation

    cMatrix = confusion_matrix(y_true,y_predict)
    df_cm = pd.DataFrame(cMatrix,index=target_names,columns=target_names)
    plt.figure(figsize = (6,4))
    cm = sns.heatmap(df_cm,annot=True,fmt="d")
    cm.yaxis.set_ticklabels(cm.yaxis.get_ticklabels(),rotation=90)
    cm.xaxis.set_ticklabels(cm.xaxis.get_ticklabels(),rotation=0)
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
# get the 3 class names
class_names = train.Survived.unique()

confusionM_linear_svc(y_test,y_pred,class_names)


# Perceptron

A perceptron is a single-layer neural network that takes a set of input features and produces a binary output (0 or 1) based on a linear combination of the input features.

In [None]:
# Train the perception classifier
perceptron = Perceptron()
perceptron.fit(x_train, y_train)

# Make predictions on the testing set
y_pred = perceptron.predict(x_test)

#KFOLD cross validation score
scores = cross_val_score(perceptron, x_train, y_train, cv=10, scoring = "accuracy")
print("Scores:", scores)
mean_perceptron =  round(scores.mean()* 100, 2)
print("Mean:", mean_perceptron)
print("Standard Deviation:", scores.std())

# Calculate the accuracy,confusion matrix,precesion,recall and f1 score of the model
acc_perceptron = round(accuracy_score(y_pred, y_test) * 100, 2)
cm_perceptron = confusion_matrix(y_test, y_pred) 
precision_perceptron = round(precision_score(y_test, y_pred) * 100, 2)
recall_perceptron = round(recall_score(y_test, y_pred) * 100, 2)
f1_perceptron = round(f1_score(y_test, y_pred) * 100, 2)

# print the results
print("Confusion matrix:\n", cm_perceptron)
print("Accuracy: {:.2f}".format(acc_perceptron))
print("Precision: {:.2f}".format(precision_perceptron))
print("Recall: {:.2f}".format(recall_perceptron))
print("F1 score: {:.2f}".format(f1_perceptron))

# Confusion matrix visualization

def confusionM_perceptron(y_true,y_predict,target_names):
    
#function for confusion matrix visualisation

    cMatrix = confusion_matrix(y_true,y_predict)
    df_cm = pd.DataFrame(cMatrix,index=target_names,columns=target_names)
    plt.figure(figsize = (6,4))
    cm = sns.heatmap(df_cm,annot=True,fmt="d")
    cm.yaxis.set_ticklabels(cm.yaxis.get_ticklabels(),rotation=90)
    cm.xaxis.set_ticklabels(cm.xaxis.get_ticklabels(),rotation=0)
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
# get the 3 class names
class_names = train.Survived.unique()

confusionM_perceptron(y_test,y_pred,class_names)


# Decision Tree

A decision tree is a type of predictive model that is commonly used in machine learning and data mining. It is a flowchart-like structure in which each internal node represents a test on a feature, each branch represents the outcome of the test, and each leaf node represents a class label or a decision.

In [None]:
# Train the Decesion tree classifier
decisiontree = DecisionTreeClassifier()
decisiontree.fit(x_train, y_train)

# Make predictions on the testing set
y_pred = decisiontree.predict(x_test)


#KFOLD cross validation score
scores = cross_val_score(decisiontree, x_train, y_train, cv=10, scoring = "accuracy")
print("Scores:", scores)
mean_decisiontree =  round(scores.mean()* 100, 2)
print("Mean:", mean_decisiontree)
print("Standard Deviation:", scores.std())
# Calculate the accuracy,confusion matrix,precesion,recall and f1 score of the model
acc_decisiontree = round(accuracy_score(y_pred, y_test) * 100, 2)
cm_decisiontree = confusion_matrix(y_test, y_pred) 
precision_decisiontree = round(precision_score(y_test, y_pred) * 100, 2)
recall_decisiontree = round(recall_score(y_test, y_pred) * 100, 2)
f1_decisiontree = round(f1_score(y_test, y_pred) * 100, 2)

# print the results
print("Confusion matrix:\n", cm_decisiontree)
print("Accuracy: {:.2f}".format(acc_decisiontree))
print("Precision: {:.2f}".format(precision_decisiontree))
print("Recall: {:.2f}".format(recall_decisiontree))
print("F1 score: {:.2f}".format(f1_decisiontree))

# Confusion matrix visualization

def confusionM_decisiontree(y_true,y_predict,target_names):
    
#function for confusion matrix visualisation

    cMatrix = confusion_matrix(y_true,y_predict)
    df_cm = pd.DataFrame(cMatrix,index=target_names,columns=target_names)
    plt.figure(figsize = (6,4))
    cm = sns.heatmap(df_cm,annot=True,fmt="d")
    cm.yaxis.set_ticklabels(cm.yaxis.get_ticklabels(),rotation=90)
    cm.xaxis.set_ticklabels(cm.xaxis.get_ticklabels(),rotation=0)
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
# get the 3 class names
class_names = train.Survived.unique()

confusionM_decisiontree(y_test,y_pred,class_names)

# Random Forest

A random forest is a type of ensemble learning method in machine learning, and it is based on constructing multiple decision trees and combining their outputs to make a final prediction.

In [None]:
# Train the Random Forest classifier
randomforest = RandomForestClassifier()
randomforest.fit(x_train, y_train)

# Make predictions on the testing set
y_pred = randomforest.predict(x_test)

#KFOLD cross validation score

scores = cross_val_score(randomforest, x_train, y_train, cv=10, scoring = "accuracy")
print("Scores:", scores)
mean_randomforest =  round(scores.mean()* 100, 2)
print("Mean:", mean_randomforest)
print("Standard Deviation:", scores.std())

# Calculate the accuracy,confusion matrix,precesion,recall and f1 score of the model
acc_randomforest = round(accuracy_score(y_pred, y_test) * 100, 2)
cm_randomforest = confusion_matrix(y_test, y_pred) 
precision_randomforest = round(precision_score(y_test, y_pred) * 100, 2)
recall_randomforest = round(recall_score(y_test, y_pred) * 100, 2)
f1_randomforest = round(f1_score(y_test, y_pred) * 100, 2)

# print the results
print("Confusion matrix:\n", cm_randomforest)
print("Accuracy: {:.2f}".format(acc_randomforest))
print("Precision: {:.2f}".format(precision_randomforest))
print("Recall: {:.2f}".format(recall_randomforest))
print("F1 score: {:.2f}".format(f1_randomforest))

# Confusion matrix visualization

def confusionM_randomforest(y_true,y_predict,target_names):
    
#function for confusion matrix visualisation

    cMatrix = confusion_matrix(y_true,y_predict)
    df_cm = pd.DataFrame(cMatrix,index=target_names,columns=target_names)
    plt.figure(figsize = (6,4))
    cm = sns.heatmap(df_cm,annot=True,fmt="d")
    cm.yaxis.set_ticklabels(cm.yaxis.get_ticklabels(),rotation=90)
    cm.xaxis.set_ticklabels(cm.xaxis.get_ticklabels(),rotation=0)
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
# get the 3 class names
class_names = train.Survived.unique()

confusionM_randomforest(y_test,y_pred,class_names)

# KNN or k-Nearest Neighbors

KNN or k-Nearest Neighbors is a type of supervised machine learning algorithm used for classification and regression tasks. In the KNN algorithm, the class of a new instance is predicted based on the class of its nearest neighbors in the training data.

In [None]:
# Train the KNN or k-Nearest Neighbors classifier
knn = KNeighborsClassifier()
knn.fit(x_train, y_train)

# Make predictions on the testing set
y_pred = knn.predict(x_test)

#KFOLD cross validation score

scores = cross_val_score(knn, x_train, y_train, cv=10, scoring = "accuracy")
print("Scores:", scores)
mean_knn =  round(scores.mean()* 100, 2)
print("Mean:",mean_knn)
print("Standard Deviation:", scores.std())

# Calculate the accuracy,confusion matrix,precesion,recall and f1 score of the model
acc_knn = round(accuracy_score(y_pred, y_test) * 100, 2)
cm_knn = confusion_matrix(y_test, y_pred) 
precision_knn = round(precision_score(y_test, y_pred) * 100, 2)
recall_knn = round(recall_score(y_test, y_pred) * 100, 2)
f1_knn = round(f1_score(y_test, y_pred) * 100, 2)

# print the results
print("Confusion matrix:\n", cm_knn)
print("Accuracy: {:.2f}".format(acc_knn))
print("Precision: {:.2f}".format(precision_knn))
print("Recall: {:.2f}".format(recall_knn))
print("F1 score: {:.2f}".format(f1_knn))

# Confusion matrix visualization

def confusionM_knn(y_true,y_predict,target_names):
    
#function for confusion matrix visualisation

    cMatrix = confusion_matrix(y_true,y_predict)
    df_cm = pd.DataFrame(cMatrix,index=target_names,columns=target_names)
    plt.figure(figsize = (6,4))
    cm = sns.heatmap(df_cm,annot=True,fmt="d")
    cm.yaxis.set_ticklabels(cm.yaxis.get_ticklabels(),rotation=90)
    cm.xaxis.set_ticklabels(cm.xaxis.get_ticklabels(),rotation=0)
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
# get the 3 class names
class_names = train.Survived.unique()

confusionM_knn(y_test,y_pred,class_names)

# Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) is an iterative optimization algorithm used for training machine learning models, particularly for large datasets. It is a variation of the gradient descent algorithm that computes the gradient of the loss function with respect to the model parameters using a random subset of the training data (a.k.a mini-batch), instead of the entire dataset.

In [None]:
# Train the Stochastic Gradient Descent classifier
sgd = SGDClassifier(loss='log', max_iter=1000, random_state=0)
sgd.fit(x_train, y_train)

# Make predictions on the testing set
y_pred = sgd.predict(x_test)

#KFOLD cross validation score

scores = cross_val_score(sgd, x_train, y_train, cv=10, scoring = "accuracy")
print("Scores:", scores)
mean_sgd = round(scores.mean()* 100, 2)
print("Mean:", mean_sgd)
print("Standard Deviation:", scores.std())


# Calculate the accuracy,confusion matrix,precesion,recall and f1 score of the model
acc_sgd = round(sgd.score(x_train, y_train) * 100, 2)
cm_sgd = confusion_matrix(y_test, y_pred) 
precision_sgd = round(precision_score(y_test, y_pred) * 100, 2)
recall_sgd = round(recall_score(y_test, y_pred) * 100, 2)
f1_sgd = round(f1_score(y_test, y_pred) * 100, 2)

# print the results
print("Confusion matrix:\n", cm_sgd)
print("Accuracy: {:.2f}".format(acc_sgd))
print("Precision: {:.2f}".format(precision_sgd))
print("Recall: {:.2f}".format(recall_sgd))
print("F1 score: {:.2f}".format(f1_sgd))


# Confusion matrix visualization

def confusionM_sgd(y_true,y_predict,target_names):
    
#function for confusion matrix visualisation

    cMatrix = confusion_matrix(y_true,y_predict)
    df_cm = pd.DataFrame(cMatrix,index=target_names,columns=target_names)
    plt.figure(figsize = (6,4))
    cm = sns.heatmap(df_cm,annot=True,fmt="d")
    cm.yaxis.set_ticklabels(cm.yaxis.get_ticklabels(),rotation=90)
    cm.xaxis.set_ticklabels(cm.xaxis.get_ticklabels(),rotation=0)
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
# get the 3 class names
class_names = train.Survived.unique()

confusionM_sgd(y_test,y_pred,class_names)

# Gradient Boosting Classifier

Gradient Boosting Classifier (GBC) is a type of ensemble learning algorithm used for classification problems. It is a variant of the gradient boosting algorithm that combines multiple weak classifiers to form a strong classifier.

In [None]:
# Train the Gaussian Naive Bayes classifier
gbk = GradientBoostingClassifier()
gbk.fit(x_train, y_train)

# Make predictions on the testing set
y_pred = gbk.predict(x_test)

#KFOLD cross validation score

scores = cross_val_score(gbk, x_train, y_train, cv=10, scoring = "accuracy")
print("Scores:", scores)
mean_gbk =  round(scores.mean()* 100, 2)
print("Mean:", mean_gbk)
print("Standard Deviation:", scores.std())

# Calculate the accuracy,confusion matrix,precesion,recall and f1 score of the model
acc_gbk = round(accuracy_score(y_pred, y_test) * 100, 2)
cm_gbk = confusion_matrix(y_test, y_pred) 
precision_gbk = round(precision_score(y_test, y_pred) * 100, 2)
recall_gbk = round(recall_score(y_test, y_pred) * 100, 2)
f1_gbk = round(f1_score(y_test, y_pred) * 100, 2)

# print the results
print("Confusion matrix:\n", cm_gbk)
print("Accuracy: {:.2f}".format(acc_gbk))
print("Precision: {:.2f}".format(precision_gbk))
print("Recall: {:.2f}".format(recall_gbk))
print("F1 score: {:.2f}".format(f1_gbk))

# Confusion matrix visualization

def confusionM_sgd(y_true,y_predict,target_names):
    
#function for confusion matrix visualisation

    cMatrix = confusion_matrix(y_true,y_predict)
    df_cm = pd.DataFrame(cMatrix,index=target_names,columns=target_names)
    plt.figure(figsize = (6,4))
    cm = sns.heatmap(df_cm,annot=True,fmt="d")
    cm.yaxis.set_ticklabels(cm.yaxis.get_ticklabels(),rotation=90)
    cm.xaxis.set_ticklabels(cm.xaxis.get_ticklabels(),rotation=0)
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
# get the 3 class names
class_names = train.Survived.unique()

confusionM_sgd(y_test,y_pred,class_names)

# 9. Model Comparison

# 9.1 Clustering Models

In [None]:
Clustering_models = pd.DataFrame({
    'Clustering_Models':['Kmeans' ,'Heirarichal','DBSCAN'],
    'silhouette_score': [silhouette_avg_kmeans,silhouette_avg_h,silhouette_avg_dbscan]})
    
Clustering_models.sort_values(by='silhouette_score', ascending=False)   

# 9.2 Classification Models

In [None]:
models = pd.DataFrame({
    'Classification_Models': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 'Perceptron', 'Linear SVC', 
              'Decision Tree', 'Stochastic Gradient Descent', 'Gradient Boosting Classifier'],
    
    'Accuracy Score': [acc_svc, acc_knn, acc_logreg, 
              acc_randomforest, acc_gaussian, acc_perceptron,acc_linear_svc, acc_decisiontree,
              acc_sgd, acc_gbk],
     'Precision Score':[precision_svc, precision_knn, precision_logreg, 
              precision_randomforest, precision_gaussian, precision_perceptron,precision_linear_svc, 
                        precision_decisiontree,precision_sgd, precision_gbk],
      'Recall':[recall_svc, recall_knn, recall_logreg, 
              recall_randomforest, recall_gaussian, recall_perceptron,recall_linear_svc,
                recall_decisiontree, recall_sgd, recall_gbk],
       'F1 Score':[f1_svc, f1_knn, f1_logreg, 
              f1_randomforest, f1_gaussian, f1_perceptron,f1_linear_svc, f1_decisiontree,
              f1_sgd, f1_gbk],})
models.sort_values(by='Accuracy Score', ascending=False)


KFold cross Validation is also performed to check if gradient Boosting itself is best model to be chosen

In [None]:
Classification_Models = pd.DataFrame({
    'Models':['Support Vector Machines', 'KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 'Perceptron', 'Linear SVC',
              'Decision Tree','Stochastic Gradient Descent', 
              'Gradient Boosting Classifier'],
    
   'KFold Score':[mean_svc, mean_knn, mean_logreg,mean_randomforest,
                  mean_gaussian, mean_perceptron,mean_lsvc, mean_decisiontree,
                  mean_sgd, mean_gbk],})
    
Classification_Models.sort_values(by='KFold Score', ascending=False) 

# 10. Model Prediction

Clustering does not make predictions about the survival of new passengers, as k-means,heirarichal and DBSCAN clustering is an unsupervised learning technique.

Based on the metrics provided in the table, the Gradient Boosting Classifier appears to be the best model for titanic survival data prediction with an accuracy score of 85.79% and a high precision score of 89.47%. 

Also checked KFold Cross Score with 10 folds to make sure the model selection and there as well got Gradient Boosting Classifier with higherst score(80.97)

I decided to use the Gradient Boosting Classifier model for the testing data.

## 11. Discussion and Conclusion.

For the titanic survival prediction project, the key findings includes the identification of key features that affect the survival rate, such as gender, age, and class. Various machine learning model, such as a logistic regression or a decision tree etc, can be trained on these features to predict the survival rate with a reasonable accuracy.

The limitations of the analysis includes the dataset may not be representative of the entire population on board the Titanic, which could limit the generalizability of the findings.

Future directions for improvement may include incorporating additional features, such as the location of the passenger on the ship or their occupation, or exploring different other machine learning models not used in current project to improve the accuracy of the predictions. Additionally, collecting and incorporating more data, if available, could also help to improve the accuracy of the model.

# References

Wijaya, J. and Agee, J.T., 2014. Machine learning techniques for predictive maintenance of shipboard systems. In Proceedings of the ASNE Ship Maintenance and Modernization Symposium (pp. 1-12).

Manikandan, M. and Balamurugan, K., 2015. Predicting survival on the Titanic: A comparison of machine learning techniques. International Journal of Applied Engineering Research, 10(67), pp.49-56.

Alvarez, A., 2017. Exploring the Titanic dataset with R. arXiv preprint arXiv:1703.05921.

Nedelcu, O. and Ionescu, A., 2019. Titanic: Machine learning from disaster. In 2019 IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR) (pp. 1-5). IEEE.

Mishra, S. and Singh, S., 2020. Survival prediction of Titanic passengers using data mining techniques. Journal of Big Data, 7(1), pp.1-15.