# Titanic: Machine Learning From Disaster
### by Sung Ahn and Abdul Saleh

<hr>
## Introduction
In this project, we use random forests and (gradient boosting) machine learning algorithms to predict who survived the sinking of the RMS Titanic. On our journey to achieving this goal, we go through the whole data science process from understanding the problem and getting the data to fine-tuning our models and visualizing our results. 

The Titanic dataset is perhaps the most widely analyzed dataset of all time. There exists a wealth of incredible tutorials online exploring different approaches to analyzing this dataset. So in our own analysis, we draw on the experiences of the huge community of amazing people who have already attempted this problem and shared their conclusions online. We would especially like to thank [Manav Sehgal](https://www.kaggle.com/startupsci/titanic-data-science-solutions), [Jeff Delaney](https://www.kaggle.com/jeffd23/scikit-learn-ml-from-start-to-finish), and [Ahmed Besbes](https://ahmedbesbes.com/how-to-score-08134-in-titanic-kaggle-challenge.html) whom without their insight this project would not have become a reality.    

<hr>
## Outline 
1. Understanding the problem
2. Getting the data
3. Exploring the data 
4. Picking a machine learning algorithm 
5. Preparing the data for machine learning algorithms
6. Training algorithm and fine-tuning model
7. Visualizing results and presenting solution 


<hr>
## Understanding the problem
Before we dive into the data analysis and algorithms, we first ask ourselves: is this even a problem that can be solved with machine learning? <br>
Luckily for us, lots of books have been written about the sinking of the Titanic. So before we look at the data, we do some background reading and discover that some patterns might exist, most notably: 

1. Women and children generally got first priority on the life boats 
    - This tells us that we should look for 
2. 
    -
3. There was a lot of confusion during the sinking of the ship and people chose to stay on the Titanic for arbitrary reasons. 
    - There are definitley anomalies in this dataset because it is clear that some people 
    through we conclused th

Aha, so it seems like a there are some patterns that can help us figure out who survived and who didn't. This looks like a great machine learning problem!

<hr>
## Getting the data
Kaggle, a platform for data science competitions, has kindly compiled a dataset that is perfect for our needs and put it on their [website](https://www.kaggle.com/c/titanic/data) for budding machine learning enthusiasts to use. The training dataset tells us who survived and who didn't so we can use it to train our model. The test set doesn't tell us the fate of the passengers - that's what we're supposed to predict!

After downloading the datasets, we import the data and a few libraries that we will use later on.  

In [1]:
# for data analysis and wrangling
import pandas as pd
import numpy as np
from fancyimpute import KNN
from math import ceil

# for visualizations
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# for machine learning
#from sklearn.ensemble import RandomForestClassifier

# import data 
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")



<hr>
## Exploring the data

Now let's take a look at the data:

In [3]:
display(train_data.head())
print('_'*125)
display(test_data.head())

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


_____________________________________________________________________________________________________________________________


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


#### What do **Pclass**, **SibSp**, and **Parch** mean? <br>
According to Kaggle, **Pclass** = passenger class, **Sibsp** = # of siblings/spouses aboard the Titanic, **Parch** = # of parents/children aboard the Titanic. 
<br>

#### What are the important feature types? 
- Categorical features: **Survived, Sex, Embarked, Pclass**
- Numerical features:
  - Discrete: **SibSp, Parch**
  - Continuous : **Fare, Age**
- Alphanumeric features: **Cabin, Ticket**


In [None]:
# to find out size of data
print("training data dimensions:", train_data.shape)

In [None]:
# What are the data types? Are there missing values?
train_data.info()
print('_'*125)
test_data.info()

#### What are the missing values? 
- From training set:
    - 687 missing **Cabin** values
    - 177 missing **Age** values
    - 2 missing **Embarked** values
- From test set:
    - 327 missing **Cabin** values
    - 86 missing **Age** values

In [None]:
# Summarize integer and float type features
train_data.describe()
# Show more details about specific features
#data_train.describe(percentiles=[], include=["Pclass"])
#data_train.describe(percentiles=[], include=["Pclass"])
#data_train.describe(percentiles=[], include=["Pclass"])
#data_train.describe(percentiles=[], include=["Pclass"])
#data_train.describe(percentiles=[], include=["Pclass"])"""

#### What to numerical features tell us?
- The sample's survival rate is ~38% which is a bit higher than the actual 32%
- Most passengers on board were in 3rd class, while less that 25% where in 1st class
- More than 75% of passengers were less that 38 years old and the mean age was 30 
- More than 75% of passengers did not travel with their kids or their parents


In [None]:
# Summarize object type features
train_data.describe(include=["O"])

### What this tells us

Hmm, it looks like we have some missing em ages and lots of missing cabin numbers. We will have to figure out ways to deal with those later on. 

<hr>
## Picking an algorithm

<hr>
## Preparing the data for machine learning algorithms
First we start by combining the training dataset and the test dataset so that we can edit them both together and ensure they end up in the same format. 


In [4]:
# Store the survival data
targets = train_data.Survived

# Drop survival data so training set and test set have the same shape and can be combined
train_data_dropped = train_data.drop(["Survived"], 1)

# Combine train and test data
combined_data = train_data_dropped.append(test_data)

combined_data.shape

(1309, 11)

The training data had 891 entries and the test data had 418 entries. $418 + 891 = 1309$ entires, so this looks good!

### Extracting passenger titles: 

If you look back at the data you will notice that passenger titles are always preceded by a comma and followed by a period. So we can create a function that splits the Name value at the comma and at the period to get the title.

In [5]:
# Extract titles from names and place them in a new column
combined_data["Title"] = combined_data["Name"].map(lambda name: name.split(",")[1].split(".")[0].strip())

# Drop names column because we no longer need it
#combined_data.drop(["Name"], 1, inplace=True)

# Show all different values in the Title column
combined_data.Title.value_counts()

Mr              757
Miss            260
Mrs             197
Master           61
Dr                8
Rev               8
Col               4
Ms                2
Mlle              2
Major             2
Dona              1
Lady              1
the Countess      1
Capt              1
Jonkheer          1
Mme               1
Sir               1
Don               1
Name: Title, dtype: int64

In [None]:
#back up copy
back_up2 = combined_data.copy()

Looks like it worked!
<br>
We can also group similar titles together to simplify our model and reduce the risk of "overfitting." Overfitting happens when a model is too complex in such a way that it is so good at predicting training data outcomes but not so good at predicting outcomes for unseen data. For example, it would be nice if our model knows how to predict outcomes for countesses, but chances are, there aren't that many countesses on the Titanic, so it would be better to group Royalties  together so that our model generalizes better when looking at new data.   

In [6]:
Title_dict = {
    "Mr":"Mr",
    "Miss":"Ms",
    "Mrs":"Mrs",
    "Master":"Master",
    "Dr":"Other",
    "Rev":"Other",
    "Col":"Military",
    "Ms":"Ms",
    "Mlle":"Ms",
    "Major":"Military",
    "Don":"Royal or Noble",
    "the Countess":"Royal or Noble",
    "Lady":"Royal or Noble",
    "Dona":"Royal or Noble",
    "Sir":"Royal or Noble",
    "Mme":"Mrs",
    "Jonkheer":"Royal or Noble",
    "Capt":"Other"
}

# Group titles using mappings in dictionary above
combined_data["Title"] = combined_data["Title"].map(Title_dict) 

combined_data["Title"].value_counts()

Mr                757
Ms                264
Mrs               198
Master             61
Other              17
Military            6
Royal or Noble      6
Name: Title, dtype: int64

### Processing passenger ages:

If you recall from the data exploration step, there were about 177 and 86 missing age values from the training set and the data set respectively. We know that age is an important factor in determining survival so we need to come up with a way to fill in the missing ages.

Here we are going to group people by their gender, class, and title and then use these groupings to determine the missing values.

In [7]:
# Select train data and group by Sex, class, and title in that order
grouped_train_data = combined_data.head(891).groupby(["Sex","Pclass", "Title"])

# Select test data and group by Sex, class, and title in that order
grouped_test_data = combined_data.iloc[891:].groupby(["Sex", "Pclass", "Title"])

# Find and display medians 
display(grouped_train_data.median())
print('_'*125)
display(grouped_test_data.median())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,PassengerId,Age,SibSp,Parch,Fare
Sex,Pclass,Title,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
female,1,Mrs,499.0,40.0,1.0,0.0,79.2
female,1,Ms,369.0,30.0,0.0,0.0,88.25
female,1,Other,797.0,49.0,0.0,0.0,25.9292
female,1,Royal or Noble,658.5,40.5,0.5,0.0,63.05
female,2,Mrs,438.0,32.0,1.0,0.0,26.0
female,2,Ms,444.0,24.0,0.0,0.0,13.0
female,3,Mrs,405.5,31.0,1.0,1.0,15.975
female,3,Ms,372.0,18.0,0.0,0.0,8.75625
male,1,Master,446.0,4.0,1.0,2.0,120.0
male,1,Military,592.5,54.0,0.0,0.0,28.525


_____________________________________________________________________________________________________________________________


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,PassengerId,Age,SibSp,Parch,Fare
Sex,Pclass,Title,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
female,1,Mrs,1076.0,48.0,1.0,0.0,63.3583
female,1,Ms,1074.0,32.0,0.0,0.0,158.20835
female,1,Royal or Noble,1306.0,39.0,0.0,0.0,108.9
female,2,Mrs,1123.5,29.0,0.0,0.0,26.0
female,2,Ms,1121.0,19.5,1.0,1.0,24.5
female,3,Mrs,1051.0,28.0,1.0,1.0,14.4542
female,3,Ms,1089.0,22.0,0.0,0.0,7.8792
male,1,Master,1022.0,9.5,1.0,2.0,198.4375
male,1,Military,1058.5,50.0,0.5,0.0,128.0125
male,1,Mr,1102.0,42.0,0.0,0.0,50.2479


So the function we want to create to fill in the missing ages first checks the passenger's age, then their class, then their title and uses that info to determine what age to give them. If we haven't seen that title before, we should just plug in the median age.

 <div class="alert alert-block alert-warning">Note that we have to be super careful not to introduce any information from the test data into the training data. The point of a predictive machine learning model is to make accurate predictions about new data that is *unseen* during the training.</div>

In [None]:
# Function to round approximations of missing ages to nearest 0.5             
def round_age(age):
    return round(age * 2) / 2


# Function that fills in missing ages
def age_filler(incomplete_data, grouped_data, Training=True, size=0):
    # Change index to PassengerId so that each passenger has a unique index
    if Training != True:
        incomplete_data = incomplete_data.reset_index()   
    
    for row in grouped_data:
        try:
            # This is a data frame the has the age and fares of passengers sharing the same gender, class, and title
            subset_incomplete_data = incomplete_data.loc[(incomplete_data["Sex"]==row[0][0]) 
                                                        & (incomplete_data["Pclass"]==row[0][1]) 
                                                        & (incomplete_data["Title"]==row[0][2]), 
                                                        (["Age", "Fare"])]
        # Skip to next row if no values are found
        except KeyError:
            continue
            
        # Array of global indexes of passengers in subset with missing ages
        missing_ages_global_index = subset_incomplete_data.index[subset_incomplete_data["Age"].isnull()].tolist()
        missing_ages_global_index.sort()

        # Array of local indexes of passengers in subset with missing ages
        missing_ages_local_index = np.where(subset_incomplete_data["Age"].isnull())[0]
        missing_ages_local_index.sort()

        # Get number of passengers in subset_train_data, use this number to calculate number of KNN neighbours 
        subset_incomplete_data_size = subset_incomplete_data.shape[0]
        
        try:
            # Use KNN to fill in missing ages based on similarities between passegers' fares
            # Returns an unindexed but ordered numpy array of ages
            subset_complete_data = KNN(k=ceil(subset_incomplete_data_size*0.05)).complete(subset_incomplete_data)
        # Handles case when there are no missing values
        except ValueError:
            continue
            
        counter = 0
        # Iterate over estimated ages in subset and place them back in the original dataset
        # Use indexes to match values from subset to original data set
        for passenger_local_index in missing_ages_local_index:
            passenger_estimated_age = round_age(subset_complete_data[passenger_local_index][0])
            passenger_global_index = missing_ages_global_index[counter]
            counter+=1
            completed_data = incomplete_data.set_value(passenger_global_index, "Age", passenger_estimated_age)
    # Handles test set
    if Training != True:
        return completed_data.iloc[size:] 
    # Handles training set
    else:
        return completed_data

train_data = combined_data.head(891)

# Estimate missing ages in training set, pass in a copy so that original data set is not changed
train_data_filled = age_filler(train_data.copy(), grouped_train_data)

# Estimate missing ages in test set
test_data_filled = age_filler(combined_data.copy(), grouped_train_data, Training = False, size = 891)

# Combining training and test set after filling in all ages
combined_data = train_data_filled.append(test_data_filled).reset_index()

In [9]:
combined_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
level_0        1309 non-null int64
Age            1309 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
Fare           1308 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Ticket         1309 non-null object
Title          1309 non-null object
index          418 non-null float64
dtypes: float64(3), int64(5), object(6)
memory usage: 143.2+ KB


### Processing Fare and Embarked features

In this data set we have one missing Fare value and two missing Embarked values, so let's just fill them in directly.
<br><br>
First: let's look at the passenger with a missing fare

In [10]:
missing_fare_index = combined_data.index[combined_data["Fare"].isnull()]
combined_data.iloc[missing_fare_index]

Unnamed: 0,level_0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Ticket,Title,index
1043,1043,60.5,,S,,"Storey, Mr. Thomas",0,1044,3,male,0,3701,Mr,152.0


This passenger is a male in 3rd class whose title is Mr.
Let's find the subset of passengers who also have the same gender, class, and title as this passenger and then assign him the median fare of that subset.

In [11]:
subset = combined_data.loc[(combined_data["Sex"]=="male")
                          & (combined_data["Pclass"]==3)
                          & (combined_data["Title"]=="Mr")]

# Fill empty fare with median
combined_data["Fare"].fillna(float(subset["Fare"].median()), inplace=True)

In [12]:
combined_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
level_0        1309 non-null int64
Age            1309 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
Fare           1309 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Ticket         1309 non-null object
Title          1309 non-null object
index          418 non-null float64
dtypes: float64(3), int64(5), object(6)
memory usage: 143.2+ KB


Now we do a similar thing for the missing Embarked values, but we replace missing values with the mode instead.
We also make sure we do not leak data from the test set to the training set.

In [13]:
missing_embarked_index = combined_data.index[combined_data["Embarked"].isnull()]
missing_embarked_rows = combined_data.iloc[missing_embarked_index]
display(missing_embarked_rows)

Unnamed: 0,level_0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Ticket,Title,index
61,61,38.0,B28,,80.0,"Icard, Miss. Amelie",0,62,1,female,0,113572,Ms,
829,829,62.0,B28,,80.0,"Stone, Mrs. George Nelson (Martha Evelyn)",0,830,1,female,0,113572,Mrs,


In [14]:
for index, row in missing_embarked_rows.iterrows():
    subset = combined_data.head(891).loc[(combined_data.head(891)["Sex"]==row["Sex"])
                                        & (combined_data.head(891)["Pclass"]==row["Pclass"])
                                        & (combined_data.head(891)["Title"]==row["Title"])]                                    
    # Fill empty fields with mode
    combined_data.set_value(index, "Embarked", str(subset["Embarked"].mode().iloc[0])) 

In [15]:
combined_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
level_0        1309 non-null int64
Age            1309 non-null float64
Cabin          295 non-null object
Embarked       1309 non-null object
Fare           1309 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Ticket         1309 non-null object
Title          1309 non-null object
index          418 non-null float64
dtypes: float64(3), int64(5), object(6)
memory usage: 143.2+ KB


Fare and Emarked now have no missing values.

### Creating Family_Size

Now we are going to create the **FamilySize** feature by adding **SibSp** and **Parch** and 1. This is makes sense because a **FamilySize** feature might be a better predictor than either of the other features separatley. Families tend to stick together, which probably affected their chanced of survival. We then drop the **SibSp** and **Parch** columns because the information the contain are summarized by the new **FamilySize** feature.

In [16]:
combined_data["Family_Size"] = combined_data["SibSp"] + combined_data["Parch"] + 1

# Drop all columns we don't need
combined_data_dropped = combined_data.drop(["Name", "SibSp", "Parch", "Cabin", "Ticket", "PassengerId", "index","level_0"], 1)

train_data_final = combined_data_dropped.head(891).copy().reset_index().drop(["index"],1)
display(train_data_final.head())
test_data_final = combined_data_dropped.iloc[891:].copy()
display(test_data_final.head())

Unnamed: 0,Age,Embarked,Fare,Pclass,Sex,Title,Family_Size
0,22.0,S,7.25,3,male,Mr,2
1,38.0,C,71.2833,1,female,Mrs,2
2,26.0,S,7.925,3,female,Ms,1
3,35.0,S,53.1,1,female,Mrs,2
4,35.0,S,8.05,3,male,Mr,1


Unnamed: 0,Age,Embarked,Fare,Pclass,Sex,Title,Family_Size
891,34.5,Q,7.8292,3,male,Mr,1
892,47.0,S,7.0,3,female,Mrs,2
893,62.0,Q,9.6875,2,male,Mr,1
894,27.0,S,8.6625,3,male,Mr,1
895,22.0,S,12.2875,3,female,Mrs,3


## ML

In [19]:
import h2o
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O cluster uptime:,17 secs
H2O cluster version:,3.10.4.8
H2O cluster version age:,6 months and 12 days !!!
H2O cluster name:,H2O_from_python_Example_am1iva
H2O cluster total nodes:,1
H2O cluster free memory:,237.2 Mb
H2O cluster total cores:,2
H2O cluster allowed cores:,2
H2O cluster status:,"accepting new members, healthy"
H2O connection url:,http://localhost:54321


We convert the data we are working with into H2O dataframes so that the H2O algorithms can work with them. We also append the survival data back to the training set.

In [22]:
# We append the survival data back to the training set
train_data_final_surv = train_data_final.join(targets) 

#train_data_final_surv["Survived"] = train_data_final_surv["Survived"].astype("category")

train_data_final_surv.head()

Unnamed: 0,Age,Embarked,Fare,Pclass,Sex,Title,Family_Size,Survived
0,22.0,S,7.25,3,male,Mr,2,0
1,38.0,C,71.2833,1,female,Mrs,2,1
2,26.0,S,7.925,3,female,Ms,1,1
3,35.0,S,53.1,1,female,Mrs,2,1
4,35.0,S,8.05,3,male,Mr,1,0


In [23]:
# Convert to H2O dataframes
h2o_train_data = h2o.H2OFrame(train_data_final_surv)
h2o_test_data = h2o.H2OFrame(test_data_final)

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%


In [29]:
# We make sure the survival values (1 and 0) are treated as categorical variables
h2o_train_data["Survived"] = h2o_train_data["Survived"].asfactor() 
h2o_train_data.types

{'Age': 'real',
 'Embarked': 'enum',
 'Family_Size': 'int',
 'Fare': 'real',
 'Pclass': 'int',
 'Sex': 'enum',
 'Survived': 'enum',
 'Title': 'enum'}

In [37]:
# Save data frames so we can load them again without going through whole analysis
#h2o.download_csv(h2o_train_data,"h2o_train_data.csv")
#h2o.download_csv(h2o_test_data,"h2o_test_data.csv")
h2o_train_data = h2o.import_file("h2o_train_data.csv")
h2o_test_data = h2o.import_file("h2o_test_data.csv")

In [32]:
# Get names of columns that contain the features we want to train the algorithm on
feature_names = h2o_train_data.col_names[0:-1]
# Get name of column that has targets
targets_name = h2o_train_data.col_names[-1]

Survived


In [33]:
from h2o.estimators.random_forest import H2ORandomForestEstimator

In [84]:
# Now let's create a random forest instance
RF = H2ORandomForestEstimator(ntrees=400,
                              nfolds=10,
                              binomial_double_trees=True,
                              stopping_metric = "AUC",
                              stopping_tolerance = 0,
                              stopping_rounds = 15)
                              
                              #stopping_tolerance = 0.001, #0.1% after 9 trees
                              #stopping_rounds = 3,
                              #score_tree_interval = 3)
#model_id ="RF_200t_6nf_s2", ntrees=200, nfolds=6, seed=2
# binomial_double_trees=True,
#stopping_metric="misclassification",
#stopping_rounds=3,
#stopping_tolerance=0.02,
#------
#sample_rate=0-1
#col_sample_rate_per_tree=0-1

Here nfolds refers the the number of folds we use for cross-validation. This basically allows us to train multiple models using different subset of the data to get a better sense of the model's accuracy.
<br>

There are tens of extra parameters we can adjust here. We can choose the number of trees, the depth of each tree, the number of variables to decide each split, and so on. There is a very useful class called GridSearch that trains different models with different parameters and allows us to pick the optimal parameters. 

In [None]:
# First we define the parameters we want to try
hyper_params={"max_depth": [20, 40, 60],
              "min_rows": [1, 2, 3, 4],
              "sample_rate": [0.5, 0.6, 0.7, 0.8],
              "mtries": [1, 2, 3, 4],
              "col_sample_rate_per_tree": [0.8, 0.9, 1]}

# Then we create a grid_search instance
grid_search = h2o.grid.H2OGridSearch(RF, hyper_parameters)


In [85]:
RF.train(feature_names, targets_name, training_frame=h2o_train_data)

drf Model Build progress: |███████████ (cancelled)


H2OJobCancelled: Job<$03017f00000132d4ffffffff$_b78fab6bc5bc89d2167cad800b974345> was cancelled by the user.

In [75]:
RF
#.model_performance(h2o_train_data)

Model Details
H2ORandomForestEstimator :  Distributed Random Forest
Model Key:  DRF_model_python_1512328784826_5994


ModelMetricsBinomial: drf
** Reported on train data. **

MSE: 0.127435504337922
RMSE: 0.35698109801209643
LogLoss: 0.4221182466499098
Mean Per-Class Error: 0.18353944971718916
AUC: 0.8755179539620149
Gini: 0.7510359079240299
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.6042296603436913: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,508.0,41.0,0.0747,(41.0/549.0)
1,100.0,242.0,0.2924,(100.0/342.0)
Total,608.0,283.0,0.1582,(141.0/891.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.6042297,0.7744,149.0
max f2,0.1795706,0.8108108,272.0
max f0point5,0.7670006,0.8278956,106.0
max accuracy,0.6042297,0.8417508,149.0
max precision,0.9998995,1.0,0.0
max recall,0.0094535,1.0,398.0
max specificity,0.9998995,1.0,0.0
max absolute_mcc,0.6042297,0.6611558,149.0
max min_per_class_accuracy,0.3350648,0.8040936,218.0


Gains/Lift Table: Avg response rate: 38.38 %



0,1,2,3,4,5,6,7,8,9,10,11
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,cumulative_response_rate,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0123457,0.9984817,2.6052632,2.6052632,1.0,1.0,0.0321637,0.0321637,160.5263158,160.5263158
,2,0.0213244,0.9980995,2.6052632,2.6052632,1.0,1.0,0.0233918,0.0555556,160.5263158,160.5263158
,3,0.0303030,0.9970430,2.6052632,2.6052632,1.0,1.0,0.0233918,0.0789474,160.5263158,160.5263158
,4,0.0404040,0.9960337,2.6052632,2.6052632,1.0,1.0,0.0263158,0.1052632,160.5263158,160.5263158
,5,0.0516274,0.9947166,2.6052632,2.6052632,1.0,1.0,0.0292398,0.1345029,160.5263158,160.5263158
,6,0.1010101,0.9809199,2.6052632,2.6052632,1.0,1.0,0.1286550,0.2631579,160.5263158,160.5263158
,7,0.1503928,0.9543144,2.6052632,2.6052632,1.0,1.0,0.1286550,0.3918129,160.5263158,160.5263158
,8,0.2008979,0.9084406,2.6052632,2.6052632,1.0,1.0,0.1315789,0.5233918,160.5263158,160.5263158
,9,0.3007856,0.7100722,2.5759905,2.5955420,0.9887640,0.9962687,0.2573099,0.7807018,157.5990538,159.5542027




ModelMetricsBinomial: drf
** Reported on cross-validation data. **

MSE: 0.13093557001117437
RMSE: 0.3618502038291182
LogLoss: 0.4399389772996845
Mean Per-Class Error: 0.18756590930879113
AUC: 0.8690788142183024
Gini: 0.7381576284366047
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.6109921179219233: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,510.0,39.0,0.071,(39.0/549.0)
1,104.0,238.0,0.3041,(104.0/342.0)
Total,614.0,277.0,0.1605,(143.0/891.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.6109921,0.7689822,147.0
max f2,0.1737167,0.8094981,274.0
max f0point5,0.6363673,0.8226950,140.0
max accuracy,0.6186200,0.8395062,145.0
max precision,1.0,1.0,0.0
max recall,0.0092021,1.0,396.0
max specificity,1.0,1.0,0.0
max absolute_mcc,0.6186200,0.6566955,145.0
max min_per_class_accuracy,0.3368762,0.7982456,216.0


Gains/Lift Table: Avg response rate: 38.38 %



0,1,2,3,4,5,6,7,8,9,10,11
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,cumulative_response_rate,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0101010,0.9981662,2.6052632,2.6052632,1.0,1.0,0.0263158,0.0263158,160.5263158,160.5263158
,2,0.0202020,0.9976224,2.0263158,2.3157895,0.7777778,0.8888889,0.0204678,0.0467836,102.6315789,131.5789474
,3,0.0303030,0.9962904,2.6052632,2.4122807,1.0,0.9259259,0.0263158,0.0730994,160.5263158,141.2280702
,4,0.0404040,0.9948718,2.6052632,2.4605263,1.0,0.9444444,0.0263158,0.0994152,160.5263158,146.0526316
,5,0.0505051,0.9926929,2.6052632,2.4894737,1.0,0.9555556,0.0263158,0.1257310,160.5263158,148.9473684
,6,0.1010101,0.9666982,2.4315789,2.4605263,0.9333333,0.9444444,0.1228070,0.2485380,143.1578947,146.0526316
,7,0.1503928,0.9250716,2.3684211,2.4302828,0.9090909,0.9328358,0.1169591,0.3654971,136.8421053,143.0282797
,8,0.2008979,0.8597475,2.4315789,2.4306086,0.9333333,0.9329609,0.1228070,0.4883041,143.1578947,143.0608645
,9,0.3007856,0.6321633,1.9319929,2.2650236,0.7415730,0.8694030,0.1929825,0.6812865,93.1992904,126.5023566



Cross-Validation Metrics Summary: 


0,1,2,3,4,5,6,7,8,9,10
,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid,cv_6_valid,cv_7_valid,cv_8_valid
accuracy,0.8543699,0.0220131,0.8558559,0.7786886,0.872549,0.8773585,0.8504673,0.8867925,0.8632479,0.85
auc,0.8645101,0.0337858,0.7782832,0.7989506,0.8652829,0.9048753,0.8802239,0.9096539,0.8631746,0.9156367
err,0.1456301,0.0220131,0.1441441,0.2213115,0.1274510,0.1226415,0.1495327,0.1132076,0.1367521,0.15
err_count,16.375,3.141208,16.0,27.0,13.0,13.0,16.0,12.0,16.0,18.0
f0point5,0.8228095,0.0295986,0.7916667,0.7635468,0.8682635,0.8564815,0.7932692,0.875576,0.8529412,0.7807309
f1,0.7949241,0.0421868,0.7037037,0.6966292,0.8169014,0.8505747,0.8048781,0.8636364,0.7837838,0.8392857
f2,0.77388,0.0664850,0.6333333,0.6404959,0.7712766,0.8447488,0.8168317,0.8520179,0.725,0.9073359
lift_top_group,2.6438289,0.2426308,3.46875,2.3921568,2.6153846,2.409091,2.675,2.3555555,2.7857144,2.4489796
logloss,0.4384234,0.0555803,0.4872983,0.6086873,0.4794729,0.3785191,0.4091382,0.3736669,0.4175506,0.3530543


Scoring History: 


0,1,2,3,4,5,6,7,8
,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_lift,training_classification_error
,2017-12-03 17:21:01,57.818 sec,0.0,,,,,
,2017-12-03 17:21:01,57.861 sec,1.0,0.4235991,5.0897550,0.8123059,2.4691673,0.1724138
,2017-12-03 17:21:01,57.909 sec,2.0,0.4212999,4.9424512,0.8117206,2.5813617,0.1858238
,2017-12-03 17:21:01,57.957 sec,3.0,0.4174501,4.4697644,0.8104056,2.6052632,0.1859756
,2017-12-03 17:21:01,58.004 sec,4.0,0.4166071,4.2762428,0.8136201,2.6052632,0.1898396
---,---,---,---,---,---,---,---,---
,2017-12-03 17:21:04,1 min 1.171 sec,34.0,0.3634469,0.6808998,0.8719841,2.6052632,0.1705948
,2017-12-03 17:21:04,1 min 1.347 sec,35.0,0.3639197,0.6486683,0.8713743,2.6052632,0.1739618
,2017-12-03 17:21:04,1 min 1.518 sec,36.0,0.3633722,0.6483883,0.8712332,2.6052632,0.1739618



See the whole table with table.as_data_frame()
Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
Title,16446.0625000,1.0,0.2444879
Fare,12855.4248047,0.7816719,0.1911093
Age,12655.5585938,0.7695191,0.1881381
Sex,11858.7734375,0.7210707,0.1762930
Pclass,6632.9677734,0.4033165,0.0986060
Family_Size,4902.5878906,0.2981010,0.0728821
Embarked,1916.0163574,0.1165030,0.0284836




Now we finally make our predictions on the test set and prepare it for submission

In [76]:
predictions = RF.predict(h2o_test_data)

drf prediction progress: |████████████████████████████████████████████████| 100%


In [39]:
predictions.head

predict,p0,p1
0,0.878779,0.121221
0,0.705082,0.294918
0,0.706405,0.293595
0,0.467629,0.532371
1,0.216161,0.783839
0,0.929172,0.070828
0,0.567109,0.432891
0,0.955731,0.0442692
1,0.15317,0.84683
0,0.947986,0.0520137


<bound method H2OFrame.head of >

In [77]:
# Convert H2O dataframe back to pandas dataframe
predictions = predictions.as_data_frame()

predictions.drop(["p0","p1"], 1, inplace=True)

# Create a PassengerId column
predictions.insert(0, "PassengerId", range(892, 1310))

# Rename predict column to Survived 
predictions.rename(columns={"predict": "Survived"}, inplace=True)

predictions.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1


In [78]:
# Export predictions to csv file
predictions.to_csv("predictions.csv", index=False)

In [56]:
h2o.cluster().shutdown()

    >>> h2o.shutdown()
        ^^^^ Deprecated, use ``h2o.cluster().shutdown()``.
H2O session _sid_ad35 closed.
