## 🔗 **[Read Full Blog - Complete ML Project Titanic Survival Prediction](https://copyassignment.com/titanic-survival-prediction-machine-learning-project-part-1/)**

# **Will You Survive on Titanic?** 🚢

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# **1. Importing Necessary Libraries 📚**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree,svm
from sklearn.metrics import accuracy_score

# **2. Loading Dataset 📊**

In [None]:
train_data = pd.read_csv('/kaggle/input/titanic/train.csv')

# Printing first 10 rows of the dataset
train_data.head(10)

**Types of Variables:**
1. Continuous: Age & Fare
2. Categorical: Sex & Embarked
3. Discrete: SibSp & Parch
4. Alphanumeric: Cabin

In [None]:
print('The shape of our training set: %s passengers and %s features'%(train_data.shape[0],train_data.shape[1]))

In [None]:
train_data.info()

As you can see we have 891 entries in total but some of the columns have less than 891 entries so that means we have missing values in these columns Age, Cabin & Embarked. so we have to preprocess our data first before training our ml model. 

**Checking NULL Values**

In [None]:
train_data.isnull().sum()

* Ok so we have 177 null values in Age column.
* 687 missing values in Cabin column.
* 2 missing values in Embarked column.

## **Plotting Heat Map 🗺️** 
* To see the correlation between target variable and other parameters.

In [None]:
heatmap = sns.heatmap(train_data[["Survived", "SibSp", "Parch", "Age", "Fare"]].corr(), annot = True)
sns.set(rc={'figure.figsize':(12,10)})

**Conclusion:** 
Only Fare feature seems to have a significant correlation with the survival probability.

# **3. Exploratory Data Analysis 📉**

Now we're going to visualise the correlation of each variable with the target variable i.e, Survived.

## **(A) SibSp - Number of Siblings / Spouses aboard the Titanic**

In [None]:
# Finding unique values
train_data['SibSp'].unique()

In [None]:
bargraph_sibsp = sns.catplot(x = "SibSp", y = "Survived", data = train_data, kind="bar", height = 8)

**Conclusion:** 
* Passengers having 1 or 2 siblings have good chances of survival
* More no. of siblings -> Less chances of survival

## **(B) AGE**

In [None]:
ageplot = sns.FacetGrid(train_data, col="Survived", height = 7)
ageplot = ageplot.map(sns.distplot, "Age")
ageplot = ageplot.set_ylabels("Survival Probability")

**Conclusion:** 
More age -> less chances of survival!

## **(C) Sex**

In [None]:
sexplot = sns.barplot(x="Sex", y="Survived", data=train_data)

**Conclusion:** From the above graph it's quite obvious that females have more chances of survival in comparison to males. 

## **(D) Pclass**

In [None]:
pclassplot = sns.catplot(x = "Pclass", y="Survived", data = train_data, kind="bar", height = 6)

**Higher class -> More chances of survival**

## **(E) Pclass vs Survived By Sex**

In [None]:
a = sns.catplot(x = "Pclass", y="Survived", hue="Sex", data=train_data, height = 7, kind="bar")

**Conclusion:**
* In each class females have much higher chances of survival in comparison to male passengers.

## **(F) Embarked**

In [None]:
train_data["Embarked"].isnull().sum()

In [None]:
train_data["Embarked"].value_counts()

In [None]:
# Filling 2 missing values with most frequent value
train_data["Embarked"] = train_data["Embarked"].fillna('S')

In [None]:
sns.catplot(x="Embarked", y="Survived", data=train_data, height = 5, kind="bar")

## **(G) Pclass vs Survived by Embarked**

In [None]:
sns.catplot(x="Pclass", col="Embarked", data = train_data, kind="count", height=7)

Passengers embarked from C station, majority of them was from 1st class. That's why we got survival probability of C embarked passengers higher.

# **4. Data Preprocessing (Cleaning) 🧹**

In [None]:
train_data.isnull().sum()

177 Missing values in Age column
687 missing values in Cabin column. 
We have to deal with these missing values in order to build a good ml model. 

## **(A) Handling Missing Values of Age Column**

In [None]:
mean = train_data["Age"].mean()
std = train_data["Age"].std()
print(mean)
print(std)

In [None]:
rand_age = np.random.randint(mean-std, mean+std, size = 177)
age_slice = train_data["Age"].copy()

age_slice[np.isnan(age_slice)] = rand_age
train_data["Age"] = age_slice
train_data.isnull().sum()

## **(B) Dropping 🗑️ Columns**

In [None]:
col_to_drop = ["PassengerId", "Ticket", "Cabin", "Name"]
train_data.drop(col_to_drop, axis=1, inplace=True)
train_data.head(10)

## **(D) Converting Categorical Variables to Numeric**

In [None]:
genders = {"male":0, "female":1}
train_data["Sex"] = train_data["Sex"].map(genders)

ports = {"S":0, "C":1, "Q":2}
train_data["Embarked"] = train_data["Embarked"].map(ports)

train_data.head()

# **5. Building Machine Learning Model 🤖**

In [None]:
df_train_x = train_data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]

# Target variable column
df_train_y = train_data[['Survived']]

x_train, x_test, y_train, y_test = train_test_split(df_train_x, df_train_y, test_size=0.20, random_state=42)

## **(A) Random Forest Classifier 🌳🌳🌳🌳**

In [None]:
clf1 = RandomForestClassifier()
clf1.fit(x_train, y_train)
rfc_y_pred = clf1.predict(x_test)
rfc_accuracy = accuracy_score(y_test,rfc_y_pred) * 100
print("accuracy=",rfc_accuracy)

## **(B) Logistic Regression**

In [None]:
clf2 = LogisticRegression()
clf2.fit(x_train, y_train)
lr_y_pred = clf2.predict(x_test)
lr_accuracy = accuracy_score(y_test,lr_y_pred)*100

print("accuracy=",lr_accuracy)

## **(C) K-Neighbor Classifier**

In [None]:
clf3 = KNeighborsClassifier(5)
clf3.fit(x_train, y_train)
knc_y_pred = clf3.predict(x_test)
knc_accuracy = accuracy_score(y_test,knc_y_pred)*100

print("accuracy=",knc_accuracy)

## **(D) Decision Tree Classifier**

In [None]:
clf4 = tree.DecisionTreeClassifier()
clf4 = clf4.fit(x_train, y_train)
dtc_y_pred = clf4.predict(x_test)
dtc_accuracy = accuracy_score(y_test,dtc_y_pred)*100

print("accuracy=",dtc_accuracy)

## **(E) Support Vector Machine**

In [None]:
clf5 = svm.SVC()
clf5.fit(x_train, y_train)
svm_y_pred = clf5.predict(x_test)
svm_accuracy = accuracy_score(y_test,svm_y_pred)*100
print("accuracy=",svm_accuracy)

# **Accuracy of all 5 Classifiers**

In [None]:
print("Accuracy of Random Forest Classifier =",rfc_accuracy)
print("Accuracy of Logistic Regressor =",lr_accuracy)
print("Accuracy of K-Neighbor Classifier =",knc_accuracy)
print("Accuracy of Decision Tree Classifier = ",dtc_accuracy)
print("Accuracy of Support Vector Machine Classifier = ",svm_accuracy)

Since we're getting maximum accuracy score with Random Forest Classifier so we choose it for making predictions on test.csv.

# **Making Prediction for Test.csv 📝**

In [None]:
# Importing test.csv
test_data = pd.read_csv('/kaggle/input/titanic/test.csv')
test_data.head(10)

In [None]:
test_data.info()

In [None]:
test_data.isnull().sum()

## **Data Preprocessing for testing data**

In [None]:
# Replacing missing values of age column
mean = test_data["Age"].mean()
std = test_data["Age"].std()
rand_age = np.random.randint(mean-std, mean+std, size = 86)
age_slice = test_data["Age"].copy()
age_slice[np.isnan(age_slice)] = rand_age
test_data["Age"] = age_slice

# Replacing missing value of Fare column
test_data['Fare'].fillna(test_data['Fare'].mean(), inplace=True)

test_data.isnull().sum()

In [None]:
col_to_drop = ["PassengerId", "Ticket", "Cabin", "Name"]
test_data.drop(col_to_drop, axis=1, inplace=True)
test_data.head(10)

In [None]:
genders = {"male":0, "female":1}
test_data["Sex"] = test_data["Sex"].map(genders)

ports = {"S":0, "C":1, "Q":2}
test_data["Embarked"] = test_data["Embarked"].map(ports)

test_data.head()

# **Final Submission ✔️**

In [None]:
x_test = test_data
y_pred = clf1.predict(x_test)
originaltest_data = pd.read_csv('/kaggle/input/titanic/test.csv')
submission = pd.DataFrame({
        "PassengerId": originaltest_data["PassengerId"],
        "Survived": y_pred
    })
submission.head(20)