# Titanic: Machine Learning From Disaster
### by Sung Ahn and Abdul Saleh

<hr>
## Introduction
In this project, we use random forests and (gradient boosting) machine learning algorithms to predict who survived the sinking of the RMS Titanic. On our journey to achieving this goal, we go through the whole data science process from understanding the problem and getting the data to fine-tuning our models and visualizing our results. 

The Titanic dataset is perhaps the most widely analyzed dataset of all time. There exists a wealth of incredible tutorials online exploring different approaches to analyzing this dataset. So in our own analysis, we draw on the experiences of the huge community of amazing people who have already attempted this problem and shared their conclusions online. We would especially like to thank [Manav Sehgal](https://www.kaggle.com/startupsci/titanic-data-science-solutions), [Jeff Delaney](https://www.kaggle.com/jeffd23/scikit-learn-ml-from-start-to-finish), and [Ahmed Besbes](https://ahmedbesbes.com/how-to-score-08134-in-titanic-kaggle-challenge.html) whom without their insight this project would not have become a reality.    

<hr>
## Outline 
1. Understanding the problem
2. Getting the data
3. Exploring the data 
4. Picking a machine learning algorithm 
5. Preparing the data for machine learning algorithms
6. Training algorithm and fine-tuning model
7. Visualizing results and presenting solution 


<hr>
## Understanding the problem
Before we dive into the data analysis and algorithms, we first ask ourselves: is this even a problem that can be solved with machine learning? <br>
Luckily for us, lots of books have been written about the sinking of the Titanic. So before we look at the data, we do some background reading and discover that some patterns might exist, most notably: 

1. Women and children generally got first priority on the life boats 
    - This tells us that we should look for 
2. 
    -
3. There was a lot of confusion during the sinking of the ship and people chose to stay on the Titanic for arbitrary reasons. 
    - There are definitley anomalies in this dataset because it is clear that some people 
    through we conclused th

Aha, so it seems like a there are some patterns that can help us figure out who survived and who didn't. This looks like a great machine learning problem!

<hr>
## Getting the data
Kaggle, a platform for data science competitions, has kindly compiled a dataset that is perfect for our needs and put it on their [website](https://www.kaggle.com/c/titanic/data) for budding machine learning enthusiasts to use. The training dataset tells us who survived and who didn't so we can use it to train our model. The test set doesn't tell us the fate of the passengers - that's what we're supposed to predict!

After downloading the datasets, we import the data and a few libraries that we will use later on.  

In [None]:
# for data analysis and wrangling
import pandas as pd
import numpy as np
from fancyimpute import KNN
from math import ceil

# for visualizations
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# for machine learning
#from sklearn.ensemble import RandomForestClassifier

# import data 
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

<hr>
## Exploring the data

Now let's take a look at the data:

In [None]:
display(train_data.head())
print('_'*125)
display(test_data.head())

#### What do **Pclass**, **SibSp**, and **Parch** mean? <br>
According to Kaggle, **Pclass** = passenger class, **Sibsp** = # of siblings/spouses aboard the Titanic, **Parch** = # of parents/children aboard the Titanic. 
<br>

#### What are the important feature types? 
- Categorical features: **Survived, Sex, Embarked, Pclass**
- Numerical features:
  - Discrete: **SibSp, Parch**
  - Continuous : **Fare, Age**
- Alphanumeric features: **Cabin, Ticket**


In [None]:
# to find out size of data
print("training data dimensions:", train_data.shape)

In [None]:
# What are the data types? Are there missing values?
train_data.info()
print('_'*125)
test_data.info()

#### What are the missing values? 
- From training set:
    - 687 missing **Cabin** values
    - 177 missing **Age** values
    - 2 missing **Embarked** values
- From test set:
    - 327 missing **Cabin** values
    - 86 missing **Age** values

In [None]:
# Summarize integer and float type features
train_data.describe()
# Show more details about specific features
#data_train.describe(percentiles=[], include=["Pclass"])
#data_train.describe(percentiles=[], include=["Pclass"])
#data_train.describe(percentiles=[], include=["Pclass"])
#data_train.describe(percentiles=[], include=["Pclass"])
#data_train.describe(percentiles=[], include=["Pclass"])"""

#### What to numerical features tell us?
- The sample's survival rate is ~38% which is a bit higher than the actual 32%
- Most passengers on board were in 3rd class, while less that 25% where in 1st class
- More than 75% of passengers were less that 38 years old and the mean age was 30 
- More than 75% of passengers did not travel with their kids or their parents


In [None]:
# Summarize object type features
train_data.describe(include=["O"])

### What this tells us

Hmm, it looks like we have some missing em ages and lots of missing cabin numbers. We will have to figure out ways to deal with those later on. 

<hr>
## Picking an algorithm

<hr>
## Preparing the data for machine learning algorithms
First we start by combining the training dataset and the test dataset so that we can edit them both together and ensure they end up in the same format. 


In [None]:
# Store the survival data
targets = train_data.Survived

# Drop survival data so training set and test set have the same shape and can be combined
train_data_dropped = train_data.drop(["Survived"], 1)

# Combine train and test data
combined_data = train_data_dropped.append(test_data)

combined_data.shape

The training data had 891 entries and the test data had 418 entries. $418 + 891 = 1309$ entires, so this looks good!

### Extracting passenger titles: 

If you look back at the data you will notice that passenger titles are always preceded by a comma and followed by a period. So we can create a function that splits the Name value at the comma and at the period to get the title.

In [None]:
# Extract titles from names and place them in a new column
combined_data["Title"] = combined_data["Name"].map(lambda name: name.split(",")[1].split(".")[0].strip())

# Drop names column because we no longer need it
combined_data.drop(["Name"], 1, inplace=True)

# Show all different values in the Title column
combined_data.Title.value_counts()

In [None]:
#back up copy
back_up2 = combined_data.copy()

Looks like it worked!
<br>
We can also group similar titles together to simplify our model and reduce the risk of "overfitting." Overfitting happens when a model is too complex in such a way that it is so good at predicting training data outcomes but not so good at predicting outcomes for unseen data. For example, it would be nice if our model knows how to predict outcomes for countesses, but chances are, there aren't that many countesses on the Titanic, so it would be better to group Royalties  together so that our model generalizes better when looking at new data.   

In [None]:
Title_dict = {
    "Mr":"Mr",
    "Miss":"Ms",
    "Mrs":"Mrs",
    "Master":"Master",
    "Dr":"Other",
    "Rev":"Other",
    "Col":"Military",
    "Ms":"Ms",
    "Mlle":"Ms",
    "Major":"Military",
    "Don":"Royal or Noble",
    "the Countess":"Royal or Noble",
    "Lady":"Royal or Noble",
    "Donna":"Royal or Noble",
    "Sir":"Royal or Noble",
    "Mme":"Mrs",
    "Jonkheer":"Royal or Noble",
    "Capt":"Other"
}

# Group titles using mappings in dictionary above
combined_data["Title"] = combined_data["Title"].map(Title_dict) 

combined_data["Title"].value_counts()

### Processing passenger ages:

If you recall from the data exploration step, there were about 177 and 86 missing age values from the training set and the data set respectively. We know that age is an important factor in determining survival so we need to come up with a way to fill in the missing ages.

Here we are going to group people by their gender, class, and title and then use these groupings to determine the missing values.

In [None]:
# Select train data and group by Sex, class, and title in that order
grouped_train_data = combined_data.head(891).groupby(["Sex","Pclass", "Title"])

# Select test data and group by Sex, class, and title in that order
grouped_test_data = combined_data.iloc[891:].groupby(["Sex", "Pclass", "Title"])

# Find and display medians 
display(grouped_train_data.median())
print('_'*125)
display(grouped_test_data.median())

So the function we want to create to fill in the missing ages first checks the passenger's age, then their class, then their title and uses that info to determine what age to give them. If we haven't seen that title before, we should just plug in the median age.

 <div class="alert alert-block alert-warning">Note that we have to be super careful not to introduce any information from the test data into the training data. The point of a predictive machine learning model is to make accurate predictions about new data that is *unseen* during the training.</div>

In [None]:
# Function to round approximations of missing ages to nearest 0.5             
def round_age(age):
    return round(age * 2) / 2


# Function that fills in missing ages
def age_filler(incomplete_data, grouped_data, Training=True, size=0):
    # Change index to PassengerId so that each passenger has a unique index
    if Training != True:
        incomplete_data.reset_index()   
    
    for row in grouped_data:
        try:
            # This is a data frame the has passengers sharing the same gender, class, and title
            subset_incomplete_data = incomplete_data.loc[(incomplete_data["Sex"]==row[0][0]) 
                                                        & (incomplete_data["Pclass"]==row[0][1]) 
                                                        & (incomplete_data["Title"]==row[0][2])]
        # Skip to next row if no values are found
        except KeyError:
            continue
            
        # Extract age and fare columns
        subset_incomplete_data = subset_incomplete_data[["Age", "Fare"]]

        # Array of global indexes of passengers in subset with missing ages
        missing_ages_global_index = subset_incomplete_data.index[subset_incomplete_data["Age"].isnull()].tolist()
        missing_ages_global_index.sort()

        # Array of local indexes of passengers in subset with missing ages
        missing_ages_local_index = np.where(subset_incomplete_data["Age"].isnull())[0]
        missing_ages_local_index.sort()

        # Get number of passengers in subset_train_data, use this number to calculate number of KNN neighbours 
        subset_incomplete_data_size = subset_incomplete_data.shape[0]
        
        try:
            # Use KNN to fill in missing ages based on similarities between passegers' fares
            # Returns an unindexed but ordered numpy array of ages
            subset_complete_data = KNN(k=ceil(subset_incomplete_data_size*0.05)).complete(subset_incomplete_data)
        # Handles case when there are no missing values
        except ValueError:
            continue
            
        counter = 0
        # Iterate over estimated ages in subset and place them back in the original dataset
        # Use indexes to match values from subset to original data set
        for passenger_local_index in missing_ages_local_index:
            passenger_estimated_age = round_age(subset_complete_data[passenger_local_index][0])
            passenger_global_index = missing_ages_global_index[counter]
            counter+=1
            completed_data = incomplete_data.set_value(passenger_global_index, "Age", passenger_estimated_age)
    # Handles test set
    if Training != True:
        return completed_data.iloc[size:] 
    # Handles training set
    else:
        return completed_data

train_data = combined_data.head(891).copy()

# Estimate missing ages in training set
train_data_filled = age_filler(train_data, grouped_train_data)

# Estimate missing ages in test set
test_data_filled = age_filler(combined_data.copy(), grouped_train_data, Training = False, size = 891)

# Combining training and test set, now with estimates for all ages
combined_data = train_data_filled.append(test_data_filled).reset_index()

In [None]:
combined_data.info()

In [None]:
combined_data = train_data_filled.append(test_data_filled).reset_index()
#combined_data = combined_data.reset_index()
display(combined_data[888:893])

### Processing Fare and Embarked features

In this data set we have one missing fare value and two missing Embarked values, so let's just fill them in directly.
<br><br>
First: let's look at the passenger with a missing fare

In [None]:
missing_fare_index = combined_data.index[combined_data["Fare"].isnull()]
combined_data.iloc[missing_fare_index]

This passenger is a male in 3rd class whose title is Mr.
Let's find the subset of passengers who also have the same gender, class, and title as this passenger and then assign him the median fare of that subset.

In [None]:
subset = combined_data.loc[(combined_data["Sex"]=="male")
                          & (combined_data["Pclass"]==3)
                          & (combined_data["Title"]=="Mr")]

# Fill empty fare with median
combined_data["Fare"].fillna(float(subset["Fare"].median()), inplace=True)

In [None]:
combined_data.info()

Now we do a similar thing for the missing Embarked values, but we replace missing values with the mode instead.
We also make sure we do not leak data from the test set to the training set.

In [None]:
missing_embarked_index = combined_data.index[combined_data["Embarked"].isnull()]
missing_embarked_rows = combined_data.iloc[missing_embarked_index]
display(missing_embarked_rows)

In [None]:
for index, row in missing_embarked_rows.iterrows():
    subset = combined_data.head(891).loc[(combined_data.head(891)["Sex"]==row["Sex"])
                                        & (combined_data.head(891)["Pclass"]==row["Pclass"])
                                        & (combined_data.head(891)["Title"]==row["Title"])]
    combined_data.set_value(index, "Embarked", str(subset["Embarked"].mode());

In [None]:
combined_data.info()

Fare and Emarked now have no missing values.

### Creating Family_Size

Now we are going to create the **FamilySize** feature by adding **SibSp** and **Parch** and 1. This is makes sense because a **FamilySize** feature might be a better predictor than either of the other features separatley. Families tend to stick together, which probably affected their chanced of survival. We then drop the **SibSp** and **Parch** columns because the information the contain are summarized by the new **FamilySize** feature.

In [None]:
combined_data["Family_Size"] = combined_data["SibSp"] + combined_data["Parch"] + 1

combined_data_dropped = combined_data.drop(["SibSp", "Parch", "Cabin", "Ticket", "PassengerId"], 1)

train_set_final = combined_data_dropped.head(891).copy().reindex()
display(train_set_final.head())
test_set_final = combined_data_dropped.iloc[891:].copy().reset_index().drop(["index"], 1)
display(test_set_final.head())

## ML

In [None]:
train_set_final.info()

In [None]:
import h2o
h2o.init()

We convert the data we are working with into H2O dataframes so that the H2O algorithms can work with them. 

In [None]:
h2o_train_data = h2o.H2OFrame(train_set_final)
h2o_test_data = h2o.H2OFrame(test_set_final)

# This is the survival data we put aside earlier
targets_frame = targets.to_frame();
h2o_targets = h2o.H2OFrame(targets_frame)

h2o_targets.shape
h2o_train_data.shape
# Combine survival data and training data again
#h2o_train_data_survival = h2o_train_data.cbind(h2o_targets)
#display(h2o_train_data_survival)

In [None]:
h2o_train_data.describe()
#display(h2o_test_data)
#display(targets_frame)

In [None]:
from h2o.estimators.random_forest import H2ORandomForestEstimator

In [None]:
RF = H2ORandomForestEstimator(ntrees=200, nfolds=6, seed=2)

In [None]:
RF.train(h2o_train_data, h2o_targets, training_frame=train,

In [None]:
#print(type(targets))
#targets = targets.to_frame().reset_index()
#targets = targets.rename(columns={0:'list'})
#targets.index.name = 'index'
#display(targets)
train_data_surv = pd.read_csv("train.csv")
targets = train_data_surv.Survived

In [None]:
#targets.index.name = 'index'
#targets.reset_index()
#targets.to_frame()
#display(targets)
targets = targets.drop(["index"], 1, inplace = True)
display(targets)

In [None]:
h2o_targets = h2o.H2OFrame(targets_frame)
display(h2o_targets)