# Titanic: Machine Learning from Disaster
---

### Overview
This notebook contains work for the [Kaggle Titanic Competition](https://www.kaggle.com/c/titanic). From the website:

> The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

> One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

> In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

We will attempt to build a model predicts which passengers aboard the Titanic survived. Data sourced from https://www.kaggle.com/c/titanic/data.

### Load data

In [1]:
# Import Libraries
import pandas as pd
import numpy as np

# Set options
from IPython.display import display
pd.set_option('display.max_columns', 15)

# Load the train and test datasets into two DataFrames
train = pd.read_csv("data/train.csv")
test = pd.read_csv("data/test.csv")

### Data Exploration

Let's start by getting a general sense for the data that we will be dealing with.

In [5]:
# Get dimensions
print("The training set has {} observations and {} variables.".format(*train.shape))
print("The training set has the following NaN values:")
print(train.isnull().sum())
print("The testing set has {} observations and {} variables.".format(*test.shape))
print("The testing set has the following NaN values:")
print(test.isnull().sum())

# Explore the data
display(train.head(5))
display(train.describe())

The training set has 891 observations and 12 variables.
The training set has the following NaN values:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
The testing set has 418 observations and 11 variables.
The testing set has the following NaN values:
PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


Let's start with the numeric fields. The first thing we notice is that Age has plenty of `NaN` values. Since we don't want to remove the Age variable nor remove the observations with missing values, we will need to clean this up.

We can use the `fillna` method on the Age series to replace any `NaN` with a value of our choice. There are plenty of different strategies to achomplish this, for our purposes we will replace the missing values with the median age. The test set also has a NaN for Fair, so let's also fix that in the same way.

### Preprocess: NaN

In [6]:
# Replace NaN values in the Age variable
train["Age"] = train["Age"].fillna(train["Age"].median())
test["Age"] = test["Age"].fillna(test["Age"].median())
test["Fare"] = test["Fare"].fillna(test["Fare"].median())
display(train.describe())

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.361582,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,13.019697,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,22.0,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,35.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


That should do it for the numeric variables, but we still need to consider the non-numeric variables:
* `Name`
* `Sex`
* `Cabin`
* `Embarked`
* `Ticket`

We can't easily feed non-numeric columns into a machine learning algorithm, so we will have to convert the ones we want to use to a dummy or categorical variable. `Name` is probably not a feature that will be useful for making predictions, so we can throw that out (it might be possible to extract titles like "lord" from the names which could be useful). `Cabin` appears to have many `NaN` values, so let's throw that out. `Ticket` doesn't appear to be useful without a better understanding of what the ticket numbers mean, so let's also get rid of that.

`Sex` is probably important, the whole women and children first thing, so let's work on converting that to a dummy variable. Let's convert all the values of `male` to `0` and all the values of `female` to `1`.

### Preprocess: Categorical Variables

In [7]:
# Explore Sex
print(test["Sex"].unique())

# Convert male to 0
train.loc[train["Sex"] == "male", "Sex"] = 0
test.loc[test["Sex"] == "male", "Sex"] = 0

# Convert female to 1
train.loc[train["Sex"] == "female", "Sex"] = 1
test.loc[test["Sex"] == "female", "Sex"] = 1

display(train['Sex'].head())

['male' 'female']


0    0
1    1
2    1
3    1
4    0
Name: Sex, dtype: object

`Embarked` is another interesting variable. It tells us which port a passenger embarked from. There are four unique values for `Embarked`:
* `S`: Departed from Southampton.
* `C`: Departed from Cherbourg.
* `Q`: Departed from Queenstown.
* `NaN`: Missing value.

Since `S` is the most common port, let's replace the `NaN` values with that and then convert the three letter codes to numbers.

In [8]:
# Explore Embarked
print(train["Embarked"].unique())
print(train["Embarked"].value_counts(normalize = True))

# Fill NaN values with S
train["Embarked"] = train["Embarked"].fillna("S")
test["Embarked"] = test["Embarked"].fillna("S")

# Convert S = 0, C = 1, Q = 2
train.loc[train["Embarked"] == "S", "Embarked"] = 0
test.loc[test["Embarked"] == "S", "Embarked"] = 0

train.loc[train["Embarked"] == "C", "Embarked"] = 1
test.loc[test["Embarked"] == "C", "Embarked"] = 1

train.loc[train["Embarked"] == "Q", "Embarked"] = 2
test.loc[test["Embarked"] == "Q", "Embarked"] = 2

display(train['Embarked'].head())

['S' 'C' 'Q' nan]
S    0.724409
C    0.188976
Q    0.086614
Name: Embarked, dtype: float64


0    0
1    1
2    0
3    0
4    0
Name: Embarked, dtype: object

### Strategize

Now that our variables are cleaned up, let's look a bit closer at the data. For example, we might assume that women and children had a higher probability of survival.

In [9]:
# Explore survival in training set
print("Survival rate in training set: {:.0%}".format(train["Survived"].value_counts(normalize = True)[1]))

# Explore survival by sex in training set
print("Male survival rate in training set: {:.0%}".format(train["Survived"] \
                                                          [train["Sex"] == 0].value_counts(normalize=True)[1]))
print("Female survival rate in training set: {:.0%}".format(train["Survived"] \
                                                          [train["Sex"] == 1].value_counts(normalize=True)[1]))

# Explore survival by age in training set
print("Survival rate of those under 10: {:.0%}".format(train["Survived"] \
                                                       [train["Age"] < 10].value_counts(normalize=True)[1]))

Survival rate in training set: 38%
Male survival rate in training set: 19%
Female survival rate in training set: 74%
Survival rate of those under 10: 61%


We can also create some of our own features. For example we can create a variable that tells us whether or not a passenger is a child (`Age < 10`) or how many people are in a passenger's family (`SibSp` + `Parch`).

In [10]:
# Create the variable Child and assign to 'NaN'
train["Child"] = float('NaN')
test["Child"] = float('NaN')

# Assign 1 to passengers under 10, 0 to those 12 or older.
train.loc[train["Age"] < 10, "Child"] = 1
train.loc[train["Age"] >= 10, "Child"] = 0
test.loc[test["Age"] < 10, "Child"] = 1
test.loc[test["Age"] >= 10, "Child"] = 0

# Create the variable FamilySize
train["FamilySize"] = train["SibSp"] + train["Parch"]
test["FamilySize"] = test["SibSp"] + test["Parch"]

Recall that we didn't think `Name` would be very useful, but it might be able to extract the title of a passenger (Mr, Mrs, Master, etc). This could be a useful proxy for wealth and marital status.

In [11]:
# Import regular expression library
import re

# Build function to get titles
def get_title(name):
    # Titles always consist of capital and lowercase letters, and end with a period.
    title_search = re.search(' ([A-Za-z]+)\.', name)
    # If the title exists, extract and return it.
    if title_search:
        return title_search.group(1)
    return ""

# Get all the titles and print how often each one occurs.
titles = train["Name"].apply(get_title)
test_titles = test["Name"].apply(get_title)
display(pd.value_counts(titles))

# Map each title to an integer.  Some titles are very rare, and are compressed into the same codes as other titles.
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7,
                 "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10,
                 "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2, "Dona": 10}
for k,v in title_mapping.items():
    titles[titles == k] = v
    test_titles[test_titles == k] = v

# Verify that we converted everything.
display(pd.value_counts(titles))

# Add in the title column.
train["Title"] = titles
test["Title"] = test_titles

Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Major         2
Mlle          2
Col           2
Countess      1
Lady          1
Capt          1
Mme           1
Don           1
Sir           1
Ms            1
Jonkheer      1
Name: Name, dtype: int64

1     517
2     183
3     125
4      40
5       7
6       6
7       5
10      3
8       3
9       2
Name: Name, dtype: int64

Going even further, we might be able to combine `FamilySize` and a person's last name to determine which family a person belongs too. Giving each family a unique id could be a useful feature.

In [12]:
# Import operator library
import operator

# Initiate dictionary mapping family name to id
family_id_mapping = {}

# Build function to get the id
def get_family_id(row):
    # Find the last name by splitting on a comma
    last_name = row["Name"].split(",")[0]
    # Create the family id
    family_id = "{0}{1}".format(last_name, row["FamilySize"])
    # Look up the id in the mapping
    if family_id not in family_id_mapping:
        if len(family_id_mapping) == 0:
            current_id = 1
        else:
            # Get the maximum id from the mapping and add one to it if we don't have an id
            current_id = (max(family_id_mapping.items(), key=operator.itemgetter(1))[1] + 1)
        family_id_mapping[family_id] = current_id
    return family_id_mapping[family_id]

# Get the family ids with the apply method
family_ids = train.apply(get_family_id, axis=1)
test_family_ids = test.apply(get_family_id, axis=1)

# There are a lot of family ids, so we'll compress all of the families under 3 members into one code.
family_ids[train["FamilySize"] < 3] = -1
test_family_ids[test["FamilySize"] < 3] = -1

# Print the count of each unique id.
display(pd.value_counts(family_ids))

train["FamilyId"] = family_ids
test["FamilyId"] = test_family_ids

-1      800
 14       8
 149      7
 63       6
 50       6
 59       6
 17       5
 384      4
 27       4
 25       4
 162      4
 8        4
 84       4
 340      4
 43       3
 269      3
 58       3
 633      2
 167      2
 280      2
 510      2
 90       2
 83       1
 625      1
 376      1
 449      1
 498      1
 588      1
dtype: int64

Finally, let's see if we can create a variable that tells us whether the passenger is a mother or not. It could be argued that a mother with her child would have a higher probability of making it to a lifeboat. To filter for mothers, I will select:
* `Sex = 1` (Female)
* `Age > 18`
* `Parch > 0`
* `Title != 2` (Miss)

Mothers will be labeled 1 and not mothers will be labeled 0.

In [13]:
# Create the variable Mother and assign to 'NaN'
train["Mother"] = 0
test["Mother"] = 0

# Assign 1 to mothers, 0 to non-mothers.
train.loc[(train["Sex"] == 1) & (train["Age"] > 18) & (train["Parch"] > 1) & (test["Title"] != 2), "Mother"] = 1
test.loc[(test["Sex"] == 1) & (test["Age"] > 18) & (test["Parch"] > 1) & (test["Title"] != 2), "Mother"] = 1

### Evaluation Metric
Before we begin building our model, we will need to specify our evaluation metric. From the Kaggle competition description, the error metric is simply the percentage of correct predictions.

```
accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions)
```

Since we'll be using GridSearchCV to optimize our classifier, let's pick a more robust evaluation metric for the training.

In [14]:
# Import evaluation libraries
from sklearn.metrics import f1_score, make_scorer
from time import time

# Build scorer for GridSearch
scorer = make_scorer(f1_score)

### Build Model: Random Forest
We will be using a random forest classifier to model survival. With RandomForests, a diverse set of classifiers is created by introducing randomness in the classifier construction. It fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

Decision Trees use simple decision rules inferred from the data to predict the value of the target variable. This makes sense for our case since we can imagine some of these simple rules ourselves. For example, if age is less than 10 or if the passenger is female, we could predict survived.

In [17]:
# Import libraries
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import GridSearchCV

# Initialize predictors, features, and labels
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked",
              "FamilySize", "Child", "Title", "FamilyId", "Mother"]
features = train[predictors]
labels = train["Survived"]

# Initialize classifier and parameters
clf = RandomForestClassifier()
params = {"n_estimators": [50, 100, 150],
          "criterion": ['gini', 'entropy'],
          "max_features": [None, 'auto'],
          "max_depth": [None, 24, 25, 26],
          "min_samples_split": [10, 15, 20],
          "min_samples_leaf": [1, 2, 5]}

# Initialize cross validation
cv = StratifiedShuffleSplit(n_splits=5, test_size=0.20)

# Build GridSearch and fit
clf = GridSearchCV(clf, params, scoring=scorer, cv=cv)
clf.fit(features, labels)

print("The best parameters are {} with a score of {:.0%}".format(clf.best_params_, clf.best_score_))

The best parameters are {'criterion': 'entropy', 'max_depth': 24, 'max_features': None, 'min_samples_leaf': 1, 'min_samples_split': 15, 'n_estimators': 100} with a score of 77%


Now that we have our optimized classifier, we can work on our submission to Kaggle. Kaggle requires a csv submission with two columns (PassengerId, Survived).

In [18]:
# Manually create classifier
clf = RandomForestClassifier(n_estimators=150, criterion='entropy', max_features=None,
                             max_depth=25, min_samples_split=15, min_samples_leaf=1)

# Initialize our inputs using the whole training set this time
X_train = train[predictors]
y_train = train["Survived"]
X_test = test[predictors]

# Train the classifier
clf.fit(X_train, y_train)

# Make predictions using the test set
predictions = clf.predict(X_test)

# Create a new dataframe with only the columns Kaggle wants from the dataset
submission = pd.DataFrame({"PassengerId": test["PassengerId"], 
                           "Survived": predictions})

# Export to csv
submission.to_csv("data/titanic_submission_rf.csv", index=False)