# 1. Introduction 👋
<img src="https://nypost.com/wp-content/uploads/sites/2/2017/01/170103-titanic-ship-feature.jpg?quality=90&strip=all&w=1488" alt="Titanic Ship Picture" width="800" height="800"><br>

## Data Set Problems 🤔
👉 This dataset contains information about general information (gender, age) and detail of each passengers (ticket class, fare, cabin number, and etc). Machine learning model is needed in order **to predict survivor of titanic passenger.**

---

## Objectives of Notebook 📌
👉 **This notebook aims to:**
*   Dataset exploration using various types of data visualization.
*   Build various ML models that can predict survivor of titanic passenger.
*   Generating prediction output in csv format.

👨‍💻 **The machine learning models used in this project are:** 
1. Logistic Regression
2. SVC
3. K Neighbors Classifier
4. Decision Tree
5. Random Forest
6. Gradient Boosting

---

## Data Set Description 🧾

👉 There are **12 variables** in this data set:
*   **4 categorical** variables,
*   **4 continuous** variables,
*   **1** variable that contains ID of passenger,
*   **1** variable to accommodate the name of passenger,
*   **1** variable that stores ticket number, and
*   **1** variable with various cabin number.

<br>

👉 The following is the **structure of the data set**.


<table style="width:100%">
<thead>
<tr>
<th style="text-align:center; font-weight: bold; font-size:14px">Variable Name</th>
<th style="text-align:center; font-weight: bold; font-size:14px">Description</th>
<th style="text-align:center; font-weight: bold; font-size:14px">Sample Data</th>
</tr>
</thead>
<tbody>
<tr>
<td><b>PassengerId</b></td>
<td>ID of passenger <br> (unique ID)</td>
<td>1; 2; ...</td>
</tr>
<tr>
<td><b>Survived</b></td>
<td>Survival status <br> (0 = No, 1 = Yes)</td>
<td>0; 1; ...</td>
</tr>
<tr>
<td><b>Pclass</b></td>
<td>Ticket class <br> (1 = 1st/Upper, 2 = 2nd/Middle, 3 = 3rd/Lower)</td>
<td>1; 3; ...</td>
</tr>
<tr>
<td><b>Name</b></td>
<td>Passenger name <br> (unique)</td>
<td>Braund, Mr. Owen Harris; Heikkinen, Miss. Laina; ...</td>
</tr>
<tr>
<td><b>Sex</b></td>
<td>Passenger gender <br> (male or female)</td>
<td>male; female; ...</td>
</tr>
<tr>
<td><b>Age</b></td>
<td>Passenger age <br> (in years)</td>
<td>22; 38; ...</td>
</tr>
<tr>
<td><b>SibSp</b></td>
<td>No of siblings / spouses aboard the Titanic</td>
<td>0; 3; ...</td>
</tr>
<tr>
<td><b>Parch</b></td>
<td># of parents / children aboard the Titanic</td>
<td>1; 2; ...</td>
</tr>
<tr>
<td><b>Ticket</b></td>
<td>Ticket number<br> (unique)</td>
<td>A/5 21171; PC 17599; ...</td>
</tr>
<tr>
<td><b>Fare</b></td>
<td>Passenger fare</td>
<td>7.25; 71.2833; ...</td>
</tr>
<tr>
<td><b>Cabin</b></td>
<td>Cabin number</td>
<td>C85; C123; ...</td>
</tr>
<tr>
<td><b>Embarked</b></td>
<td>Embarkation port<br> (C = Cherbourg, Q = Queenstown, S = Southampton)</td>
<td>C; S; ...</td>
</tr>
</tbody>
</table>

---

**Like this notebook? You can support me by giving upvote** 😆👍🔼 <br>
👉 *More about myself: [linktr.ee/caesarmario_](http://linktr.ee/caesarmario_)*

# 2. Importing Libraries 📚
👉 **Importing libraries** that will be used in this notebook.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import missingno as mso
import seaborn as sns
import warnings
import os

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

warnings.simplefilter(action='ignore', category=FutureWarning)

# 3. Reading Data Set 👓
👉 After importing libraries, we will also **import the datasets** that will be used, which are **train and test** data sets.

In [None]:
train_ds = pd.read_csv("../input/titanic/train.csv")
test_ds = pd.read_csv("../input/titanic/test.csv")

In [None]:
train_ds.head()

In [None]:
print(train_ds.info())

👉 In train data set, **it can be seen the data type for each column**. Also, **there are some null values** in specific columns.

In [None]:
test_ds.head()

In [None]:
print(test_ds.info())

👉 Same like train data set, in test data set **there are some null values in specific columns**.

# 4. Data Exploration 🔍
👉 This section will perform data exploration of data set that has been imported.

## 4.1 Survived Distribution 😇

In [None]:
train_ds.Survived.value_counts()

In [None]:
sns.countplot(x="Survived", data=train_ds, palette="mako_r")
plt.xlabel('Survived (0=No, 1=Yes)')
plt.show()

In [None]:
countNotSurvive = len(train_ds[train_ds.Survived == 0])
countSurvive = len(train_ds[train_ds.Survived == 1])
print("Not Survive Percentage: {:.2f}%".format((countNotSurvive / (len(train_ds.Survived))*100)))
print("Survive Percentage: {:.2f}%".format((countSurvive / (len(train_ds.Survived))*100)))

👉 It can be seen that **most passengers are not survived** 😢.

## 4.2 Gender Distribution 👫

In [None]:
sns.countplot(x='Sex', data=train_ds, palette="bwr")
plt.xlabel("Sex")
plt.show()

In [None]:
countFemale = len(train_ds[train_ds.Sex == "female"])
countMale = len(train_ds[train_ds.Sex == "male"])
print("Female Percentage: {:.2f}%".format((countFemale / (len(train_ds.Sex))*100)))
print("Male Percentage: {:.2f}%".format((countMale / (len(train_ds.Sex))*100)))

👉 **The percentage of male passengers is higher** than female passengers.

## 4.3 Survived Distribution based on Gender 😇👫

In [None]:
pd.crosstab(train_ds.Sex,train_ds.Survived).plot(kind="bar",figsize=(10,6))
plt.title('Survived status based on Gender')
plt.xlabel('Gender')
plt.xticks(rotation=0)
plt.legend(["Not Survived", "Survived"])
plt.ylabel('Frequency')
plt.show()

👉 The number of **male passengers that are not survived is higher** than survied 😢. <br>
👉 The number of **female passengers that are survived is higher** than not survived.

## 4.4 Survived Distribution based on Age 😇👴

In [None]:
pd.crosstab(train_ds.Age,train_ds.Survived).plot(kind="bar",figsize=(25,8), color=['#C30281','#13BADF'])
plt.title('Survived Status based on Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

👉 It can be seen that **the majority of not survived passengers are between 16-30 y.o**, quite similar to survived passengers in the same age range.

## 4.5 Passenger Class based on Age 🌟👴

In [None]:
pd.crosstab(train_ds.Pclass,train_ds.Sex).plot(kind="bar",figsize=(10,6), color=['#86BD49','#F1DDDF'])
plt.title('Passenger class based on Gender')
plt.xlabel('Gender')
plt.ylabel('Frequency')
plt.xticks(rotation=0)
plt.show()

👉 Most of passengers from **both genders prefer to choose 3rd class** rather than other classes.

# 5. Dataset Preprocessing 🧹
👉 This section will preprocess/clean both data sets.

## 5.1 Detecting Missing Values 🚫

In [None]:
train_ds.isnull().sum()

In [None]:
plt.figure(figsize = (16, 5))
ax_train = plt.subplot(1,2,2)
mso.bar(train_ds, ax = ax_train, fontsize = 12)

👉 Missing Values detected in **"Age", "Cabin", and "Embarked"** column in **train dataset**.

In [None]:
test_ds.isnull().sum()

In [None]:
plt.figure(figsize = (16, 5))
ax_train = plt.subplot(1,2,2)
mso.bar(test_ds, ax = ax_train, fontsize = 12)

👉 Missing Values detected in **"Age", "Cabin", and "Fare"** column in **test dataset**.

## 5.2 Replacing Missing Values 📝
👉 Imputation is a technique for substituting an estimated value for missing values in a dataset. In this section, the imputation will be performed for variables that have missing values.

In [None]:
# Age (with mean)
train_ds['Age'] = train_ds['Age'].fillna(train_ds['Age'].mean())
test_ds['Age'] = test_ds['Age'].fillna(test_ds['Age'].mean())

In [None]:
# Fare (with mean)
test_ds['Fare'] = test_ds['Fare'].fillna(test_ds['Fare'].mean())

## 5.3 Distribution of Numerical Value 📈
👉 In this section will show the distribution of numerical variables and the skewness for each numerical variables.

In [None]:
train_ds.hist(grid=False, figsize=(18, 12), bins=5)

In [None]:
train_ds.skew(axis = 0, skipna = True)

In [None]:
test_ds.hist(grid=False, figsize=(18, 12), bins=5)

In [None]:
test_ds.skew(axis = 0, skipna = True)

👉 **Skewness of Fare, SibSp, and Parch** is **high**, **square root transformation will be performed** for test and training

## 5.4 Square root transformation 🔨

In [None]:
train_ds.Fare = np.sqrt(train_ds.Fare)
test_ds.Fare = np.sqrt(test_ds.Fare)

train_ds.SibSp = np.sqrt(train_ds.SibSp)
test_ds.SibSp = np.sqrt(test_ds.SibSp)

train_ds.Parch = np.sqrt(train_ds.Parch)
test_ds.Parch = np.sqrt(test_ds.Parch)

## 5.5 Feature Engineering 🔧
👉 The FE method that used is **one-hot encoding**, which is **transforming categorical variables into a form that could be provided to ML algorithms to do a better prediction**.

In [None]:
train_ds = pd.get_dummies(train_ds, columns=['Sex', 'Embarked', 'Pclass'])
test_ds = pd.get_dummies(test_ds, columns=['Sex', 'Embarked', 'Pclass'])

In [None]:
train_ds.head()

In [None]:
test_ds.head()

## 5.6 Dropping Columns 🔻
👉 Since **Cabin, Name, Ticket, and PassengerId** contains unique data, these columns will be **removed**.

In [None]:
train_ds = train_ds.drop(['Cabin','Name','Ticket'], axis = 1)
test_ds = test_ds.drop(['Cabin','Name','Ticket'], axis = 1)

In [None]:
train_ds1 = train_ds.drop(['PassengerId'], axis = 1)
test_ds1 = test_ds.drop(['PassengerId'], axis = 1)

# 6. Dataset Preparation ⚙
👉 This section will prepare the dataset before building the machine learning models

## 6.1 Splitting the dataset into 80% training, 20% test 🪓

In [None]:
predictors = train_ds1.drop(["Survived"], axis=1)
target = train_ds1["Survived"]
x_train, x_test, y_train, y_test = train_test_split(predictors, target, test_size = 0.2, random_state = 0)

## 6.2 SMOTE Technique ⚒¶
👉 Since the number of not survived passengers is more than survived passengers, **oversampling is carried out** to avoid overfitting.

In [None]:
from imblearn.over_sampling import SMOTE
x_train, y_train = SMOTE().fit_resample(x_train, y_train)

In [None]:
sns.set_theme(style="darkgrid")
sns.countplot(y=y_train, data=test_ds, palette="mako_r")
plt.ylabel('Survived Status')
plt.xlabel('Total')
plt.show()

# 7. Model Building 🛠

## 7.1 Logistic Regression

In [None]:
LRclassifier = LogisticRegression(solver='liblinear', max_iter=5000)
LRclassifier.fit(x_train, y_train)

y_pred = LRclassifier.predict(x_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
LRAcc = accuracy_score(y_pred,y_test)

print('Logistic regression accuracy: {:.2f}%'.format(LRAcc*100))

## 7.2 Decision Tree

In [None]:
DTclassifier = DecisionTreeClassifier(max_leaf_nodes=10)
DTclassifier.fit(x_train, y_train)

y_pred = DTclassifier.predict(x_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
DTAcc = accuracy_score(y_pred,y_test)

print('Decision tree accuracy: {:.2f}%'.format(DTAcc*100))

In [None]:
scoreListDT = []
for i in range(2,50):
    DTclassifier = DecisionTreeClassifier(max_leaf_nodes=i)
    DTclassifier.fit(x_train, y_train)
    scoreListDT.append(DTclassifier.score(x_test, y_test))
    
plt.plot(range(2,50), scoreListDT)
plt.xticks(np.arange(2,50,2))
plt.xlabel("Leaf")
plt.ylabel("Score")
plt.show()
DTAccMax = max(scoreListDT)
print("DT Acc Max: {:.2f}%".format(DTAccMax*100))

## 7.3 SVC

In [None]:
SVCclassifier = SVC(kernel='linear')
SVCclassifier.fit(x_train, y_train)

y_pred = SVCclassifier.predict(x_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
SVCAcc = accuracy_score(y_pred,y_test)

print('SVC accuracy: {:.2f}%'.format(SVCAcc*100))

## 7.4 K Neighbors Classifier

In [None]:
KNclassifier = KNeighborsClassifier(n_neighbors=50)
KNclassifier.fit(x_train, y_train)

y_pred = KNclassifier.predict(x_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
KNAcc = accuracy_score(y_pred,y_test)

print('K Neighbors Classifier accuracy: {:.2f}%'.format(KNAcc*100))

In [None]:
scoreListknn = []
for i in range(1,30):
    KNclassifier = KNeighborsClassifier(n_neighbors = i)
    KNclassifier.fit(x_train, y_train)
    scoreListknn.append(KNclassifier.score(x_test, y_test))
    
plt.plot(range(1,30), scoreListknn)
plt.xticks(np.arange(1,30,1))
plt.xlabel("K value")
plt.ylabel("Score")
plt.show()
KNAccMax = max(scoreListknn)
print("KNN Acc Max: {:.2f}%".format(KNAccMax*100))

## 7.5 Random Forest

In [None]:
RFclassifier = RandomForestClassifier(max_leaf_nodes=10)
RFclassifier.fit(x_train, y_train)

y_pred = RFclassifier.predict(x_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

from sklearn.metrics import accuracy_score
RFAcc = accuracy_score(y_pred,y_test)
print('Random Forest accuracy is: {:.2f}%'.format(RFAcc*100))

In [None]:
scoreListRF = []
for i in range(2,55):
    RFclassifier = RandomForestClassifier(n_estimators = 1000, random_state = 1, max_leaf_nodes=i)
    RFclassifier.fit(x_train, y_train)
    scoreListRF.append(RFclassifier.score(x_test, y_test))
    
plt.plot(range(2,55), scoreListRF)
plt.xticks(np.arange(2,55,5))
plt.xlabel("RF Value")
plt.ylabel("Score")
plt.show()
RFAccMax = max(scoreListRF)
print("RF Acc Max: {:.2f}%".format(RFAccMax*100))

## 7.6 Gradient Boosting

In [None]:
paramsGB={'n_estimators':[100,200,300,400,500],
      'max_depth':[1,2,3,4,5],
      'max_leaf_nodes':[2,5,10,20,30,40,50]}

In [None]:
GB = RandomizedSearchCV(GradientBoostingClassifier(), paramsGB, cv=10)
GB.fit(x_train,y_train)

In [None]:
print(GB.best_estimator_)
print(GB.best_score_)
print(GB.best_params_)
print(GB.best_index_)

In [None]:
GBclassifier = GradientBoostingClassifier(n_estimators=400, max_depth=3, max_leaf_nodes=10)
GBclassifier.fit(x_train, y_train)

y_pred = GBclassifier.predict(x_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

from sklearn.metrics import accuracy_score
GBAcc = accuracy_score(y_pred,y_test)
print('Gradient Boosting accuracy is: {:.2f}%'.format(GBAcc*100))

# 8. Model Comparison 👀

In [None]:
compare = pd.DataFrame({'Model': ['Logistic Regression', 'Decision Tree', 'SVC', 'K Neighbors Classifier', 'Random Forest Classifier'
                                 ,'Random Forest Max', 'K Neighbors Max', 'Decision Tree Max', 'Gradient Boosting'], 
                        'Accuracy': [LRAcc*100, DTAcc*100, SVCAcc*100 , KNAcc*100, RFAcc*100, RFAccMax*100, 
                                     KNAccMax*100, DTAccMax*100, GBAcc*100]})
compare.sort_values(by='Accuracy', ascending=False)

👉 From the results, some models can achieve **up to 80% accuracy**.

# 9. Output 📤
👉 The next step will make output results in csv file

## 9.1 Making output file 📄

In [None]:
RFclassifier = RandomForestClassifier(n_estimators = 1000, random_state = 1, max_leaf_nodes=48)
RFclassifier.fit(x_train, y_train)
scoreListRF.append(RFclassifier.score(x_test, y_test))

prediction = RFclassifier.predict(test_ds1)

In [None]:
output = pd.DataFrame({'PassengerId': test_ds['PassengerId'] , 'Survived': prediction})
output.to_csv('submission.csv', index=False)

## 9.2 Output File (CSV) 📄

In [None]:
predcsv = pd.read_csv('./submission.csv')
predcsv.head()

---

**Like this notebook? You can support me by giving upvote** 😆👍🔼 <br>
👉 *More about myself: [linktr.ee/caesarmario_](http://linktr.ee/caesarmario_)* 

<br><br>
<center>
    <img src="https://i.imgur.com/qLGcpSt.png" alt="WM">
</center>