# <center>Titanic Survival Prediction 🚢</center>

<img src="https://images.unsplash.com/photo-1558431571-4a9f128e135f?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=1032&q=80">

# About the Dataset

* **Survived** - Survival (0 = No, 1 = Yes) ---> *Output Variable*
* **Pclass** - Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd) ---> *Input Variable*
* **Sex** - Sex of the passenger ---> *Input Variable*
* **Age** - Age in years ---> Input Variable
* **Sibsp** - number of siblings/spouses aboard the Titanic ---> *Input Variable*
* **Parch** - number of parents/children aboard the Titanic ---> *Input Variable*
* **Ticket** - Ticket number ---> *Input Variable*
* **Fare** - Passenger fare ---> *Input Variable*
* **Cabin** - Cabin number ---> *Input Variable*
* **Embarked** - Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton) ---> *Input Variable*

# Importing the Essential Libraries, Metrics, Tools and Models

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import roc_curve, roc_auc_score

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# Loading the Data

In [None]:
df = pd.read_csv("../input/titanic/train.csv")

# Exploratory Data Analysis

***Taking a look at the first 5 rows of the dataset.***

In [None]:
df.head()

***Checking the shape—i.e. size—of the data.***

In [None]:
df.shape

***Learning the dtypes of columns' and how many non-null values there are in those columns.***

In [None]:
df.info()

***Getting the statistical summary of dataset.***

In [None]:
df.describe().T

In [None]:
df.drop(["PassengerId", "Name", "Ticket"], axis=1, inplace=True)

# Detecting Missing Values and Duplicates

In [None]:
df.isna().sum()

***It seems that we have some missing values in our data. We have to drop the "Cabin" column because there are a lot of missing values to fix. As for other columns, we are imputing the missing values in categorical columns with mode of that particular column and missing values in numerical columns with median value of that column.***

In [None]:
df.drop("Cabin", axis=1, inplace=True)

df["Age"].fillna(df["Age"].median(), inplace=True)
df["Embarked"].fillna(df["Embarked"].mode().values[0], inplace=True)

In [None]:
df.duplicated().sum()

In [None]:
df.drop_duplicates(inplace=True)

# Data Visualization

***Visualizing the Correlation between the numerical variables using pairplot visualization.***

In [None]:
sns.pairplot(df, hue="Survived")

<h3>Plotting the Values of Each Variable</h3>

In [None]:
num_cols_viz = ["Age", "Fare"]
cat_cols_viz = ["Pclass", "Sex", "SibSp", "Parch", "Embarked"]

sns.set()

for num_col in num_cols_viz:
    plt.figure(figsize=(12,8))
    sns.distplot(df[num_col])
    plt.title(f"{num_col}", size=15)
    plt.show()

for cat_col in cat_cols_viz:
    plt.figure(figsize=(10,8))
    sns.countplot(df[cat_col])
    plt.title(f"{cat_col}", size=15)
    plt.show()

<h3>Relationship Between Each Variable and Target Variable (Survived)</h3>

In [None]:
for num_col in num_cols_viz:
    plt.figure(figsize=(12,8))
    sns.violinplot(x=df["Survived"], y=df[num_col])
    plt.title(f"{num_col} vs Survived", size=15)
    plt.show()

for cat_col in cat_cols_viz:
    plt.figure(figsize=(12,8))
    sns.barplot(x=df[cat_col], y=df["Survived"])
    plt.title(f"Survived vs {cat_col}", size=15)
    plt.show()

***Visualizing the linear correlations between variables using Heatmap visualization. The measure used for finding the linear correlation between each variable is Pearson Correlation Coefficient.***

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df[["Survived", "Age", "Fare"]].corr(), annot=True, cmap="Blues")
plt.title("Correlations Between Variables", size=16)
plt.show()

# Data Preprocessing

<h3>X, y Split</h3>

In [None]:
X = df.drop("Survived", axis=1)
y = df["Survived"]

<h3>One-Hot Encoding</h3>

In [None]:
X = pd.get_dummies(X, columns=["Embarked", "Sex"])

<h3>Data Standardization</h3>

In [None]:
scaler = StandardScaler()
X[["Age", "Fare"]] = scaler.fit_transform(X[["Age", "Fare"]])

<h3>Train-Test Split</h3>

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Machine Learning Models

In [None]:
models = pd.DataFrame(columns=["Model", "Accuracy Score"])

<h3>Logistic Regression</h3>

In [None]:
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
predictions = log_reg.predict(X_test)
score = accuracy_score(y_test, predictions)
print("Accuracy Score:", score)

new_row = {"Model": "LogisticRegression", "Accuracy Score": score}
models = models.append(new_row, ignore_index=True)

<h3>Random Forest Classifier</h3>

In [None]:
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
predictions = rfc.predict(X_test)
score = accuracy_score(y_test, predictions)
print("Accuracy Score:", score)

new_row = {"Model": "RandomForestClassifier", "Accuracy Score": score}
models = models.append(new_row, ignore_index=True)

<h3>Gradient Boosting Classifier</h3>

In [None]:
gbc = GradientBoostingClassifier()
gbc.fit(X_train, y_train)
predictions = gbc.predict(X_test)
score = accuracy_score(y_test, predictions)
print("Accuracy Score:", score)

new_row = {"Model": "GradientBoostingClassifier", "Accuracy Score": score}
models = models.append(new_row, ignore_index=True)

<h3>Support Vector Machines</h3>

In [None]:
svc = SVC()
svc.fit(X_train, y_train)
predictions = svc.predict(X_test)
score = accuracy_score(y_test, predictions)
print("Accuracy Score:", score)

new_row = {"Model": "SVC", "Accuracy Score": score}
models = models.append(new_row, ignore_index=True)

<h3>K-Nearest Neighbors</h3>

In [None]:
knn = KNeighborsClassifier(n_neighbors=2)
knn.fit(X_train, y_train)
predictions = knn.predict(X_test)
score = accuracy_score(y_test, predictions)
print("Accuracy Score:", score)

new_row = {"Model": "KNeighborsClassifier", "Accuracy Score": score}
models = models.append(new_row, ignore_index=True)

<h3>Model Comparison Before Hyperparameter Tuning</h3>

In [None]:
models.sort_values(by="Accuracy Score", ascending=False)

In [None]:
plt.figure(figsize=(12,8))
sns.barplot(x=models["Model"], y=models["Accuracy Score"])
plt.title("Models' Accuracy Scores", size=15)
plt.xticks(rotation=30)
plt.show()

# Hyperparameter Tuning

***Defining a couple of visualization functions for the convenience of evaluation***

In [None]:
def visualize_roc_auc_curve(model, model_name):
    pred_prob = model.predict_proba(X_test)
    fpr, tpr, thresh = roc_curve(y_test, pred_prob[:,1], pos_label=1)

    random_probs = [0 for i in range(len(y_test))]
    p_fpr, p_tpr, _ = roc_curve(y_test, random_probs, pos_label=1)  # tpr = fpr

    score = roc_auc_score(y_test, pred_prob[:,1])
    print("ROC AUC Score:", score)

    plt.figure(figsize=(10,8))
    plt.plot(fpr, tpr, linestyle='--',color='orange')
    plt.plot(p_fpr, p_tpr, linestyle='--', color='blue')

    plt.title(f'{model_name} ROC curve', size=15)
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive rate')
    plt.show()

In [None]:
def f_importances(model, model_name):
    f_imp = pd.DataFrame({"Feature Importances": model.feature_importances_}, index=X.columns)

    plt.figure(figsize=(12,8))
    sns.barplot(x=f_imp["Feature Importances"], y=f_imp.index)
    plt.title(f"{model_name} Feature Importances", size=15)
    plt.show()

In [None]:
tuned_models = pd.DataFrame(columns=["Model", "Accuracy Score"])

<h3>Tuning the Logistic Regression</h3>

In [None]:
param_grid_log_reg = {"C": [0.0001, 0.001, 0.01, 0.1, 1, 10]}

grid_log_reg = GridSearchCV(LogisticRegression(), param_grid_log_reg, cv=5, scoring="accuracy", verbose=0, n_jobs=-1)

grid_log_reg.fit(X_train, y_train)

In [None]:
log_reg_params = grid_log_reg.best_params_

log_reg = LogisticRegression(**log_reg_params)
log_reg.fit(X_train, y_train)
predictions = log_reg.predict(X_test)
score = accuracy_score(y_test, predictions)
print("Accuracy Score:", score)

new_row = {"Model": "LogisticRegression", "Accuracy Score": score}
tuned_models = tuned_models.append(new_row, ignore_index=True)

In [None]:
plt.figure(figsize=(8,6))
sns.heatmap(confusion_matrix(y_test, predictions), annot=True, cmap="Blues", fmt='d')
plt.title("Confusion Matrix of Logistic Regression", size=15)
plt.show()

In [None]:
visualize_roc_auc_curve(log_reg, "Logistic Regression")

<h3>Tuning the Random Forest</h3>

In [None]:
param_grid_rfc = {"max_depth": [None],
                  "max_features": [1, 3, 10],
                  "min_samples_split": [2, 3, 10],
                  "min_samples_leaf": [1, 3, 10],
                  "n_estimators" :[100, 200, 500]}

grid_rfc = GridSearchCV(RandomForestClassifier(), param_grid_rfc, cv=5, scoring="accuracy", verbose=0, n_jobs=-1)

grid_rfc.fit(X_train, y_train)

In [None]:
rfc_params = grid_rfc.best_params_

rfc = RandomForestClassifier(**rfc_params)
rfc.fit(X_train, y_train)
predictions = rfc.predict(X_test)
score = accuracy_score(y_test, predictions)
print("Accuracy Score:", score)

new_row = {"Model": "RandomForestClassifier", "Accuracy Score": score}
tuned_models = tuned_models.append(new_row, ignore_index=True)

In [None]:
plt.figure(figsize=(8,6))
sns.heatmap(confusion_matrix(y_test, predictions), annot=True, cmap="Blues", fmt='d')
plt.title("Confusion Matrix of Random Forest", size=15)
plt.show()

In [None]:
visualize_roc_auc_curve(rfc, "Random Forest")

In [None]:
f_importances(rfc, "Random Forest")

<h3>Tuning the Gradient Boosting Classifier</h3>

In [None]:
param_grid_gbc = {'n_estimators' : [100, 200, 500],
                  'learning_rate': [0.1, 0.05, 0.01],
                  'max_depth': [2, 3, 6],
                  'min_samples_leaf': [1, 2, 5]}

grid_gbc = GridSearchCV(GradientBoostingClassifier(), param_grid_gbc, cv=5, scoring="accuracy", verbose=0, n_jobs=-1)

grid_gbc.fit(X_train, y_train)

In [None]:
gbc_params = grid_gbc.best_params_

gbc = GradientBoostingClassifier(**gbc_params)
gbc.fit(X_train, y_train)
predictions = gbc.predict(X_test)
score = accuracy_score(y_test, predictions)
print("Accuracy Score:", score)

new_row = {"Model": "GradientBoostingClassifier", "Accuracy Score": score}
tuned_models = tuned_models.append(new_row, ignore_index=True)

In [None]:
plt.figure(figsize=(8,6))
sns.heatmap(confusion_matrix(y_test, predictions), annot=True, cmap="Blues", fmt='d')
plt.title("Confusion Matrix of Gradient Boosting", size=15)
plt.show()

In [None]:
visualize_roc_auc_curve(gbc, "Gradient Boosting")

In [None]:
f_importances(gbc, "Gradient Boosting")

<h3>Tuning the Support Vector Machines</h3>

In [None]:
param_grid_svc = {'gamma': [ 0.001, 0.01, 0.1, 1, 10],
                  'C': [1, 10, 50, 100, 200, 300, 500, 1000]}

grid_svc = GridSearchCV(SVC(), param_grid_svc, cv=5, scoring="accuracy", verbose=0, n_jobs=-1)

grid_svc.fit(X_train, y_train)

In [None]:
svc_params = grid_svc.best_params_

svc = SVC(**svc_params)
svc.fit(X_train, y_train)
predictions = svc.predict(X_test)
score = accuracy_score(y_test, predictions)
print("Accuracy Score:", score)

new_row = {"Model": "SVC", "Accuracy Score": score}
tuned_models = tuned_models.append(new_row, ignore_index=True)

In [None]:
plt.figure(figsize=(8,6))
sns.heatmap(confusion_matrix(y_test, predictions), annot=True, cmap="Blues", fmt='d')
plt.title("Confusion Matrix of Support Vector Machines", size=15)
plt.show()

***If we take a look at the official technical documentation of SVC, we can observe that predict_proba() function may be inconsistent with predict() function. It sucks especially on the small datasets, that's why we don't plot ROC AUC Curve for SVC model.***

<h3>Tuning the K-Nearest Neighbors</h3>

In [None]:
param_grid_knn = {"n_neighbors": range(1,20),
                  "leaf_size": range(1,50, 5),
                  "p": [1, 2]}

grid_knn = GridSearchCV(KNeighborsClassifier(), param_grid_knn, cv=5, scoring="accuracy", verbose=0, n_jobs=-1)

grid_knn.fit(X_train, y_train)

In [None]:
knn_params = grid_knn.best_params_

knn = KNeighborsClassifier(**knn_params)
knn.fit(X_train, y_train)
predictions = knn.predict(X_test)
score = accuracy_score(y_test, predictions)
print("Accuracy Score:", score)

new_row = {"Model": "KNeighborsClassifier", "Accuracy Score": score}
tuned_models = tuned_models.append(new_row, ignore_index=True)

In [None]:
plt.figure(figsize=(8,6))
sns.heatmap(confusion_matrix(y_test, predictions), annot=True, cmap="Blues", fmt='d')
plt.title("Confusion Matrix of K-Nearest Neighbors", size=15)
plt.show()

In [None]:
visualize_roc_auc_curve(knn, "K-Nearest Neighbors")

# Model Comparison After Hyperparameter Tuning

In [None]:
tuned_models.sort_values(by="Accuracy Score", ascending=False)

In [None]:
plt.figure(figsize=(12, 8))
sns.barplot(x=tuned_models["Model"], y=tuned_models["Accuracy Score"])
plt.title("Models' Accuracy Scores After Hyperparameter Tuning", size=15)
plt.xticks(rotation=30)
plt.show()

# Conclusion

<h3>After hyperparameter tuning, we can see that the model which is yielding the best accuracy score is Support Vector Machines with the accuracy score of 0.832618.</h3>

<h1 style="font-family: Times New Roman;">Thank you so much for reading notebook. Preparing notebooks is taking a great deal of time. If you liked it, please do not forget to give an upvote. Peace Out ✌️ ...</h1>