# Titanic: Machine Learning From Disaster
This notebook downloads and visualizes the dataset used in the Titanic Kaggle competition. It then attempts to use scikit-learn's random forest model to classify whether or not an individual will survive. We will also do a grid search to try and tune the hyper-parameters of our model.

In [1]:
import matplotlib
import sklearn
import matplotlib.pyplot as plt
print(f'matplotlib: {matplotlib.__version__}')
print(f'sklearn   : {sklearn.__version__}')

matplotlib: 3.0.3
sklearn   : 0.21.3


## Load the data
The first step is to load the data, get a feeling for the different data types, clean up the 'null' values, and ultimately get it into a state that we can use it in our model.

In [2]:
import pandas as pd
print(f'pandas version: {pd.__version__}')
test = pd.read_csv("../input/titanic/test.csv")
train = pd.read_csv("../input/titanic/train.csv")

pandas version: 0.25.3


In [3]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [4]:
train.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
train.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

Looking at this we will need to manipulate the columns in a few ways. Specifically we will:
* **One-hot encode:**
    * 'Sex', 'Embarked'
* **Drop:**
    * 'Name', 'Ticket', 'Cabin'

In [6]:
# For now, drop columns that would take much more work to get into a useable format
def format_data(data):
    # On-hot encode gender & embarked
    data = pd.get_dummies(data, columns=['Sex','Embarked'])
    # Drop columns too complicated for this very simple trial
    data = data.drop(['Name','Ticket','Cabin'], axis=1)
    # Fill null values with the mean of the column
    data.fillna(data.mean(), inplace=True)
    if 'Survived' in data.columns:
        data_y = data['Survived']
        data_x = data.drop(['Survived'], axis=1)
        return data_x, data_y
    else:
        return data

train_x, train_y = format_data(train)
test_x = format_data(test)

# Pull out the 
train_x.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,2.308642,29.699118,0.523008,0.381594,32.204208,0.352413,0.647587,0.188552,0.08642,0.722783
std,257.353842,0.836071,13.002015,1.102743,0.806057,49.693429,0.47799,0.47799,0.391372,0.281141,0.447876
min,1.0,1.0,0.42,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,223.5,2.0,22.0,0.0,0.0,7.9104,0.0,0.0,0.0,0.0,0.0
50%,446.0,3.0,29.699118,0.0,0.0,14.4542,0.0,1.0,0.0,0.0,1.0
75%,668.5,3.0,35.0,1.0,0.0,31.0,1.0,1.0,0.0,0.0,1.0
max,891.0,3.0,80.0,8.0,6.0,512.3292,1.0,1.0,1.0,1.0,1.0


## Build the model
In this section, I'll build a random forest model that will hopefully be able to classify the results.

In [7]:
# Import the models
from sklearn.ensemble import RandomForestClassifier

# This is the most simple random forest model that we can derive
model = RandomForestClassifier(random_state=1)
model.fit(train_x, train_y);



But, this model acheives the following accuracy:
* Train: 0.98092 %
* Test : 0.77033 % <- Not great...

This is possibly the result of overfitting. What we need to do is split our training data into a training and testing sub-samples so that we have a way to see how well it's actually doing.

In [8]:
# Let's try splitting the data into training and testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(train_x, train_y, 
                                                    test_size=0.2, 
                                                    random_state=42)
# Now retrain our model on the testing
model.fit(X_train, y_train)

# Print some statistics
from sklearn.metrics import f1_score

def summary_stats(x,y):
    pred = model.predict(x)
    f1   = f1_score(pred, y)
    acc  = model.score(x, y)
    print(f"   F1 score: {f1}")
    print(f"   Accuracy: {acc}")
print(f"Training:")
summary_stats(X_train, y_train)
print(f"Testing:")
summary_stats(X_test, y_test)

Training:
   F1 score: 0.9770992366412213
   Accuracy: 0.9831460674157303
Testing:
   F1 score: 0.7352941176470588
   Accuracy: 0.7988826815642458


Classic overfitting. So, let's try and tune some hyperparameters
* n_estimators
* max_features
* criterion

In [9]:
# Create the values that we will be testing
search_pars = {
    'n_estimators': [10, 30, 100, 300, 1000],
    'max_features': [0.25, 0.5, 0.75, 1.0],
    'criterion'   : ['gini', 'entropy'] 
}

In [10]:
from sklearn.model_selection import GridSearchCV

# Construct the model
rf_model = RandomForestClassifier(random_state=1)
clf      = GridSearchCV(rf_model, search_pars)
clf.fit(X_train, y_train);



In [11]:
# Get the results
print(clf.best_score_)
print(clf.best_params_)
tune_results = pd.DataFrame(clf.cv_results_)
tune_results.sort_values('rank_test_score')

0.8089887640449438
{'criterion': 'entropy', 'max_features': 0.5, 'n_estimators': 100}


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_criterion,param_max_features,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
27,0.292389,0.004802,0.018408,0.00057,entropy,0.5,100,"{'criterion': 'entropy', 'max_features': 0.5, ...",0.815126,0.793249,0.818565,0.808989,0.011206,1
31,0.101547,0.002016,0.007472,0.00025,entropy,0.75,30,"{'criterion': 'entropy', 'max_features': 0.75,...",0.815126,0.78903,0.818565,0.807584,0.013181,2
29,2.87969,0.036136,0.166193,0.002295,entropy,0.5,1000,"{'criterion': 'entropy', 'max_features': 0.5, ...",0.815126,0.801688,0.805907,0.807584,0.005614,2
11,0.084573,0.001236,0.006869,7.8e-05,gini,0.75,30,"{'criterion': 'gini', 'max_features': 0.75, 'n...",0.810924,0.797468,0.814346,0.807584,0.007281,2
28,0.865682,0.004715,0.053123,0.000921,entropy,0.5,300,"{'criterion': 'entropy', 'max_features': 0.5, ...",0.810924,0.805907,0.805907,0.807584,0.002367,2
0,0.029291,0.000643,0.00451,0.000552,gini,0.25,10,"{'criterion': 'gini', 'max_features': 0.25, 'n...",0.802521,0.805907,0.805907,0.804775,0.001597,6
13,0.83238,0.003281,0.053001,0.002937,gini,0.75,300,"{'criterion': 'gini', 'max_features': 0.75, 'n...",0.806723,0.793249,0.814346,0.804775,0.008717,6
24,2.587147,0.019926,0.169088,0.002546,entropy,0.25,1000,"{'criterion': 'entropy', 'max_features': 0.25,...",0.798319,0.78903,0.822785,0.803371,0.014228,8
26,0.091571,0.001217,0.007492,0.000234,entropy,0.5,30,"{'criterion': 'entropy', 'max_features': 0.5, ...",0.806723,0.78903,0.814346,0.803371,0.010598,8
33,0.967501,0.011561,0.052093,0.001618,entropy,0.75,300,"{'criterion': 'entropy', 'max_features': 0.75,...",0.815126,0.78903,0.805907,0.803371,0.010807,8


In [12]:
# Get the best model and re-fit it on all our training data
model = clf.best_estimator_
model.fit(train_x, train_y);

## Validation
Now, let's take a look at the validation statistics from our best result

In [13]:
# Now generate some statistics
print("Final model training results:")
summary_stats(train_x, train_y)

Final model training results:
   F1 score: 1.0
   Accuracy: 1.0


## Submit
So we have a very basic model, we've tuned the parameters of that model as well in order to find the best model. So now, we need to make some predictions on our test data and submit them!

In [14]:
# Predictions on test data
pred_test = model.predict(test_x)

In [15]:
submission = pd.DataFrame({"PassengerId": test_x['PassengerId'], 
                           "Survived":pred_test})
submission.describe()

Unnamed: 0,PassengerId,Survived
count,418.0,418.0
mean,1100.5,0.299043
std,120.810458,0.458387
min,892.0,0.0
25%,996.25,0.0
50%,1100.5,0.0
75%,1204.75,1.0
max,1309.0,1.0


In [16]:
submission.to_csv('submission.csv', index=False)