# The Challenge

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).



![](http://cdn.britannica.com/72/153172-050-EB2F2D95/Titanic.jpg)

# Importing libraries and data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
df_train = pd.read_csv("../input/titanic/train.csv")
df_train.head()

In [None]:
df_train.drop(["PassengerId", "Ticket"], axis=1, inplace=True)
df_train.describe()

From this, we can draw several quick observations:

1. Roughly 38% of passengers survived


2. The average passenger is around 30 years old, but there is high variance as standard deviation is >14.5. There are babies and elderly onboard as the minimum age is <1 and the oldest person onboard is 80 years old. However, most passengers (up to the 75th percentile) are <39 years old.


3. Most people only brought 1 spouse/sibling onboard (up to 75th percentile). However, there are a few who brought a large family (up to 8 spouse + siblings). 


4. Most people did not bring their parents or children onboard (up to 75th percentile). But again, there are a few who brought a large family (up to 6 parents + children). 


5. Most people paid a relatively low fare (lower than the mean) for their tickets (75% paid 31 dollars or less while the mean is 32.2 dollars). However, there are a few passengers who paid an exorbitant price for their tickets (up to 512 dollars), possibly indicating the presence of VIPs.

In [None]:
df_train.isnull().sum()

# Dealing with missing values

First, we are going to deal with the missing "Age" data. There are a few ways to impute the data, most commonly using the mean or the median. However, the mean is easily affected by extreme values and while the median is generally a good representation of our age distribution, we want a value that can represent the different demographics of our passengers. Hence, we will use salutations (Mr, Ms, Mdm etc) as a separator for the different demographics.

Let's first extract the salutations from our dataset.

In [None]:
import re
df_train["Salutations"] = df_train["Name"].str.extract(r'([A-Z]{1}[a-z]+\.)')

In [None]:
df_train["Salutations"].unique()

From salutations alone, we managed to glean even more interesting data. It seems that there are nobility (Don, Countess and Jonkheer) and military personnel (Major, Col, Capt) onboard. There are also French equivalent of English salutations (Mme = Mrs, Mlle = Ms) and clergymen present(Rev, possibly Don). Let's further explore the salutations that are age significant (Master, Miss, Mister, Mrs).

In [None]:
df_train[df_train["Salutations"] == "Master."]["Age"].median()

In [None]:
df_train[(df_train["Salutations"] == "Miss.") | (df_train["Salutations"] == "Ms.") | (df_train["Salutations"] == "Mlle.")]["Age"].median()

In [None]:
df_train[df_train["Salutations"] == "Mr."]["Age"].median()

In [None]:
df_train[(df_train["Salutations"] == "Mrs.") | (df_train["Salutations"] == "Mme.")]["Age"].median()

As we can see from their median values, each salutation represents a different age group. It is also reasonable to assume that high-ranking military personnel and nobility are usually older, so we shall group them along with other uncommon titles such as Rev and Dr. Let's now create a new feature column with these categories.

In [None]:
master = (df_train["Salutations"] == "Master.")
miss = (df_train["Salutations"] == "Miss.") | (df_train["Salutations"] == "Ms.") | (df_train["Salutations"] == "Mlle.")
mister = (df_train["Salutations"] == "Mr.")
missus = (df_train["Salutations"] == "Mrs.") | (df_train["Salutations"] == "Mme.")

In [None]:
df_train["Title"] = "Others"
df_train["Title"][master] = "Master"
df_train["Title"][miss] = "Miss"
df_train["Title"][mister] = "Mister"
df_train["Title"][missus] = "Missus"

We will now fill in the missing values for "Age" according to their titles. We will use the median for each demographic as a replacement.

In [None]:
df_train["Age"] = df_train.groupby("Title")["Age"].apply(lambda x: x.fillna(x.median()))

Now we have to deal with the missing data for "Cabin". A large proportion of it is missing, but the little amount of data that we have is important as the first letter of each cabin represents the deck. 'A' is at the top, 'B' is below 'A' and so on. We will replace all missing values with 'Z' and come back to analyze this later on.

In [None]:
df_train["Cabin"] = df_train["Cabin"].str.extract(r'([A-Z]{1})').fillna('Z')

Finally, we have to deal with 2 missing values in "Embarked". Let's just replace them with the most common value.

In [None]:
df_train["Embarked"].mode()

In [None]:
df_train["Embarked"] = df_train["Embarked"].fillna('S')

Now, let's check our dataset before we start analysing the data.

In [None]:
df_train.info()

Great. We managed to replace all missing values. Let's drop the irrelevant features in our dataset and take a final look at our clean data.

In [None]:
df_train.drop(["Name", "Salutations"], axis=1, inplace=True)
df_train.head()

# Data Visualisation & Feature Engineering

We will take a look at each feature and see how they relate to survival.

In [None]:
plt.figure(figsize=(8,8))
sns.countplot("Pclass", data=df_train, hue="Survived")

Next, let's take a look at sex and see how it affects survival.

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(10, 6))
ax=sns.countplot("Sex", data=df_train, hue="Survived", ax = axes[0])
ax1=df_train["Sex"].value_counts().plot.pie(autopct='%1.1f%%', ax = axes[1])

It is evident that females have a higher rate of survival as compared to men (65% of passengers were male and most of them did not survive). Let's convert this categorical data to numerical. 1 for female, and 0 for men.

In [None]:
sex = {"male": 0, "female": 1}
df_train["Sex"] = df_train["Sex"].map(sex)

Next, we want to see how age affects chances of survival. This is especially important as we previously identified that there is high variance in age, that 75% of passengers are <39 years old, and there are babies and elderly onboard. We want to see how those factors are related to survival. Let's first plot the distribution of age against survival.

In [None]:
plt.figure(figsize=(8,8))
survived = df_train[df_train["Survived"] == 1]
not_survived =  df_train[df_train["Survived"] == 0]

sns.distplot(survived["Age"], kde=False, label='Survived')
sns.distplot(not_survived["Age"], kde=False, label='Did not survive')
plt.legend()

It is difficult to relate survival with age alone as they seem to follow the same distribution. Let's try including sex as a factor as we saw that majority of survivors were female.

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(12, 8))
men = df_train[df_train["Sex"] == 0]
women = df_train[df_train["Sex"] == 1]

ax = sns.distplot(women[women['Survived']==1]["Age"], label="Survived", ax=axes[0], kde=False, bins=20)
ax1 = sns.distplot(women[women['Survived']==0]["Age"], label="Did not survive", ax=axes[0], kde=False, bins=20)
ax.legend()
ax.set_title('Female')

ax = sns.distplot(men[men['Survived']==1]["Age"], label="Survived", ax=axes[1], kde=False, bins=20)
ax1 = sns.distplot(men[men['Survived']==0]["Age"], label="Did not survive", ax=axes[1], kde=False, bins=20)
ax.legend()
ax.set_title('Male')

Awesome. Now we can see that majority of both genders age <8 has a high survival rate, women have a high survival rate between age 14 - 40, and men aged slightly <30 have the highest risk of dying. Let's translate age into a equal-sized feature column with the help of qcut function from pandas.

In [None]:
pd.qcut(df_train[df_train["Age"]>8]["Age"], 5)

In [None]:
age = (df_train["Age"] < 8)
age1 = (df_train["Age"] >= 9) & (df_train["Age"] < 21)
age2 = (df_train["Age"] >= 21) & (df_train["Age"] < 28)
age3 = (df_train["Age"] >= 28) & (df_train["Age"] < 30)
age4 = (df_train["Age"] >= 28) & (df_train["Age"] < 39)
age5 = (df_train["Age"] >= 39)

In [None]:
df_train["Age"][age] = 0
df_train["Age"][age1] = 1
df_train["Age"][age2] = 2
df_train["Age"][age3] = 3
df_train["Age"][age4] = 4
df_train["Age"][age5] = 5

Next, we will combine the number of sibilings/spouses and parents/children to get the total number of family members onboard. Then we will take a look at how that relates to survival.

In [None]:
plt.figure(figsize=(10, 10))
df_train["Family"] = df_train["SibSp"] + df_train["Parch"]
survived = df_train[df_train["Survived"] == 1]
not_survived =  df_train[df_train["Survived"] == 0]

sns.distplot(survived["Family"], kde=False, label="Survived", bins=50)
sns.distplot(not_survived["Family"], kde=False, label="Did not survive", bins=50)
plt.legend()

From here, we can see that your chances of survival is lowest when you have no family and have 4 or more family members. Let's reflect that as a feature column.

In [None]:
none = (df_train["Family"] == 0)
four = (df_train["Family"] >= 4)

df_train["Fam_Cat"] = 1
df_train["Fam_Cat"][none] = 0
df_train["Fam_Cat"][four] = 2

Now we will look at how Fare paid affects your chances of survival.

In [None]:
plt.figure(figsize=(10, 10))
survived = df_train[df_train["Survived"] == 1]
not_survived =  df_train[df_train["Survived"] == 0]

sns.distplot(survived["Fare"], kde=False, label='Survived', bins=100, color='green')
sns.distplot(not_survived["Fare"], kde=False, label='Did not survive', bins=100, color='red')
plt.legend()

From here, we can see that if you paid less, your relative chances of survival decreased. This is especially so when you paid <10, and between 10 - 20. Your chances of survival increased when you paid more than $50. Let's classify this into equal categories (like age).

In [None]:
pd.qcut(df_train["Fare"], 6)

In [None]:
fare = (df_train["Fare"] < 7.775)
fare1 = (df_train["Fare"] >= 7.775) & (df_train["Fare"] < 8.662)
fare2 = (df_train["Fare"] >= 8.662) & (df_train["Fare"] < 14.454)
fare3 = (df_train["Fare"] >= 14.454) & (df_train["Fare"] < 26)
fare4 = (df_train["Fare"] >= 26) & (df_train["Fare"] < 52.369)
fare5 = (df_train["Fare"] >= 52.369)

df_train["Fare"][fare] = 0
df_train["Fare"][fare1] = 1
df_train["Fare"][fare2] = 2
df_train["Fare"][fare3] = 3
df_train["Fare"][fare4] = 4
df_train["Fare"][fare5] = 5

Now, let's analyze the port of embarkation.

In [None]:
sns.catplot(x='Sex', y='Survived', kind='bar', data=df_train, hue='Embarked', palette='rocket', aspect=1.3)

Survival rate seems to be highest if embarked from C, and uncertain if embarked from S and Q. Since there are only 3 ports, let's convert this to numeric values.

In [None]:
port = {'S': 1, 'C': 2, 'Q': 3}
df_train["Embarked"] = df_train["Embarked"].map(port)

Finally, let's convert Cabin categories to numbers. According to the diagram below, A is nearest to the top, hence we can assume that passengers in Cabin A have the highest chance of survival, followed by B and then C and so on. We won't analyse this as there is too much missing data to draw conclusions about the overall population of passengers.

![](https://upload.wikimedia.org/wikipedia/commons/8/84/Titanic_cutaway_diagram.png)

In [None]:
df_train["Cabin"].unique()

In [None]:
cabin = {"Z": 8, "T": 7, "G": 6, "F": 5, "E": 4, "D": 3, "C": 2, "B": 1, "A": 0}
df_train["Cabin"] = df_train["Cabin"].map(cabin)

Let's drop all irrelevant columns and prep our data for model building.

In [None]:
df_train.head()

In [None]:
df = df_train.drop(["SibSp", "Parch", "Family"], axis=1)
df.head()

Let's map our "Title" category and create a few final features for Pclass, Sex, Age and Fare as these seem to be the most important features.

In [None]:
title = {"Master": 0, "Miss": 1, "Mister": 3, "Missus": 4, "Others": 5}
df["Title"] = df["Title"].map(title)

df["Age"] = df["Age"].astype(int)
df["Fare"] = df["Fare"].astype(int)

df["Pclass_Sex"] = df["Pclass"]*df["Sex"]
df["Pclass_Age"] = df["Pclass"]*df["Age"]
df["Pclass_Fare"] = df["Pclass"]*df["Fare"]
df["Sex_Age"] = df["Sex"]*df["Age"]
df["Sex_Fare"] = df["Sex"]*df["Fare"]
df["Age_Fare"] = df["Age"]*df["Fare"]

In [None]:
df.head()

# Model Building and Training

In this section, we will be using an algorithm known as eXtreme Gradient Boosting (XGBoost). It is somewhat similar to a Random Forest Classifier, but trains decision trees sequentially (one at a time) and each tree is design to rectify errors made by the previous tree through gradient descent. It is one of the most well known classification algorithms for its performance and speed.

We will split our train data into two different sets of data. 80% of it is for training our model, 20% of it is set aside for validation (to see how our model generalizes to new data).

In [None]:
X = df.iloc[:, 1:]
y = df.iloc[:, 0]

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2)

Now, we will manually select parameters based on commonly used values.

In [None]:
from xgboost import XGBClassifier
classifier = XGBClassifier(nthread=1, colsample_bytree=0.8, learning_rate=0.03, max_depth=4, min_child_weight=2, n_estimators=1000, subsample=0.8)
classifier.fit(X_train, y_train)

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_val)
cm = confusion_matrix(y_val, y_pred)
print(cm)
accuracy_score(y_val, y_pred)

In [None]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 5)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

Now, we want to see how each feature contributed to the prediction of our values. From the feature importance graph below, it seems that our engineered features were quite useful.

In [None]:
from xgboost import plot_importance
fig, ax = plt.subplots(figsize=(20, 10))
plot_importance(classifier, ax=ax)

Results seem decent, but definitely can be tweaked. Let's try a Grid Search with different parameter values (the grid search algorithm will test all combinations of the parameters we specify and return the best model based on our scoring metric, which is accuracy). 

In [None]:
params = {
        'min_child_weight': [1, 2],
        'learning_rate': [0.2, 0.3, 0.02, 0.03],
        'subsample': [0.6, 0.8, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0],
        'max_depth': [4, 5, 6],
        'n_estimators': [100, 500, 750, 1000]
        }

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
grid = GridSearchCV(estimator=classifier, param_grid=params, scoring='accuracy', n_jobs=4, cv=5, verbose=3)
grid.fit(X, y)
print('\n Best estimator:')
print(grid.best_estimator_)
print('\n Best score:')
print(grid.best_score_)
print('\n Best parameters:')
print(grid.best_params_)

# Model Testing and Predictions

Finally, let's work on the actual test data set and return our predictions.

In [None]:
df_test = pd.read_csv("../input/titanic/test.csv")
df_test.head()

In [None]:
df_test.isnull().sum()

In [None]:
df_test["Salutations"] = df_test["Name"].str.extract(r'([A-Z]{1}[a-z]+\.)')

master = (df_test["Salutations"] == "Master.")
miss = (df_test["Salutations"] == "Miss.") | (df_test["Salutations"] == "Ms.") | (df_test["Salutations"] == "Mlle.")
mister = (df_test["Salutations"] == "Mr.")
missus = (df_test["Salutations"] == "Mrs.") | (df_test["Salutations"] == "Mme.")

df_test["Title"] = "Others"
df_test["Title"][master] = "Master"
df_test["Title"][miss] = "Miss"
df_test["Title"][mister] = "Mister"
df_test["Title"][missus] = "Missus"

In [None]:
df_test["Age"] = df_test.groupby("Title")["Age"].apply(lambda x: x.fillna(x.median()))
df_test["Cabin"] = df_test["Cabin"].str.extract(r'([A-Z]{1})').fillna('Z')
df_test["Fare"].median()

In [None]:
df_test["Fare"] = df_test["Fare"].fillna(14.4542)
df_test.head()

In [None]:
df_test.drop(["Name", "Salutations", "Ticket"], axis=1, inplace=True)
df_test.head()

In [None]:
sex = {"male": 0, "female": 1}
df_test["Sex"] = df_test["Sex"].map(sex)


age = (df_test["Age"] < 8)
age1 = (df_test["Age"] >= 9) & (df_test["Age"] < 21)
age2 = (df_test["Age"] >= 21) & (df_test["Age"] < 28)
age3 = (df_test["Age"] >= 28) & (df_test["Age"] < 30)
age4 = (df_test["Age"] >= 28) & (df_test["Age"] < 39)
age5 = (df_test["Age"] >= 39)

df_test["Age"][age] = 0
df_test["Age"][age1] = 1
df_test["Age"][age2] = 2
df_test["Age"][age3] = 3
df_test["Age"][age4] = 4
df_test["Age"][age5] = 5

df_test["Family"] = df_test["SibSp"] + df_test["Parch"]
none = (df_test["Family"] == 0)
four = (df_test["Family"] >= 4)

df_test["Fam_Cat"] = 1
df_test["Fam_Cat"][none] = 0
df_test["Fam_Cat"][four] = 2

fare = (df_test["Fare"] < 7.775)
fare1 = (df_test["Fare"] >= 7.775) & (df_test["Fare"] < 8.662)
fare2 = (df_test["Fare"] >= 8.662) & (df_test["Fare"] < 14.454)
fare3 = (df_test["Fare"] >= 14.454) & (df_test["Fare"] < 26)
fare4 = (df_test["Fare"] >= 26) & (df_test["Fare"] < 52.369)
fare5 = (df_test["Fare"] >= 52.369)

df_test["Fare"][fare] = 0
df_test["Fare"][fare1] = 1
df_test["Fare"][fare2] = 2
df_test["Fare"][fare3] = 3
df_test["Fare"][fare4] = 4
df_test["Fare"][fare5] = 5

port = {'S': 1, 'C': 2, 'Q': 3}
df_test["Embarked"] = df_test["Embarked"].map(port)

cabin = {"Z": 8, "T": 7, "G": 6, "F": 5, "E": 4, "D": 3, "C": 2, "B": 1, "A": 0}
df_test["Cabin"] = df_test["Cabin"].map(cabin)

title = {"Master": 0, "Miss": 1, "Mister": 3, "Missus": 4, "Others": 5}
df_test["Title"] = df_test["Title"].map(title)


df_test["Age"] = df_test["Age"].astype(int)
df_test["Fare"] = df_test["Fare"].astype(int)

df_test["Pclass_Sex"] = df_test["Pclass"]*df_test["Sex"]
df_test["Pclass_Age"] = df_test["Pclass"]*df_test["Age"]
df_test["Pclass_Fare"] = df_test["Pclass"]*df_test["Fare"]
df_test["Sex_Age"] = df_test["Sex"]*df_test["Age"]
df_test["Sex_Fare"] = df_test["Sex"]*df_test["Fare"]
df_test["Age_Fare"] = df_test["Age"]*df_test["Fare"]

df_test.head()

In [None]:
df_val = df_test.drop(["PassengerId", "SibSp", "Parch", "Family"], axis=1)
df_val.head()

In [None]:
X = df.iloc[:, 1:]
y = df.iloc[:, 0]

xgb = XGBClassifier(nthread=1, colsample_bytree=1, learning_rate=0.2, max_depth=4, min_child_weight=2, n_estimators=100, subsample=0.8)
xgb.fit(X, y)

In [None]:
X_test = df_val
actual_pred = xgb.predict(X_test)

In [None]:
df_actual = pd.DataFrame(df_test["PassengerId"])
df_actual["Survived"] = actual_pred

In [None]:
df_actual.head()

In [None]:
df_actual.to_csv('Final Predictions.csv', index=False)

Done! Hope you enjoyed this walkthrough with me!