# Titanic: Machine Learning From Disaster
### by Sung Anh and Abdul Saleh

<hr>
## Introduction
In this project, we use random forests and (gradient boosting) machine learning algorithms to predict who survived the sinking of the RMS Titanic. On our journey to achieving this goal, we go through the whole data science process from understanding the problem and getting the data to fine-tuning our models and visualizing our results. 

The Titanic dataset is perhaps the most widely analyzed dataset of all time. There exists a wealth of incredible tutorials online exploring different approaches to analyzing this dataset. So in our own analysis, we draw on the experiences of the huge community of amazing people who have already attempted this problem and shared their conclusions online. We would especially like to thank [Manav Sehgal](https://www.kaggle.com/startupsci/titanic-data-science-solutions), [Jeff Delaney](https://www.kaggle.com/jeffd23/scikit-learn-ml-from-start-to-finish), and [Ahmed Besbes](https://ahmedbesbes.com/how-to-score-08134-in-titanic-kaggle-challenge.html) whom without their insight this project would not have become a reality.    

<hr>
## Outline 
1. Understanding the problem
2. Getting the data
3. Exploring the data 
4. Picking a machine learning algorithm 
5. Preparing the data for machine learning algorithms
6. Training algorithm and fine-tuning model
7. Visualizing results and presenting solution 


<hr>
## Understanding the problem
Before we dive into the data analysis and algorithms, we first ask ourselves: is this even a problem that can be solved with machine learning? <br>
Luckily for us, lots of books have been written about the sinking of the Titanic. So before we look at the data, we do some background reading and discover that some patterns might exist, most notably: 

1. Women and children generally got first priority on the life boats 
    - This tells us that we should look for 
2. 
    -
3. There was a lot of confusion during the sinking of the ship and people chose to stay on the Titanic for arbitrary reasons. 
    - There are definitley anomalies in this dataset because it is clear that some people 
    through we conclused th

Aha, so it seems like a there are some patterns that can help us figure out who survived and who didn't. This looks like a great machine learning problem!

<hr>
## Getting the data
Kaggle, a platform for data science competitions, has kindly compiled a dataset that is perfect for our needs and put it on their [website](https://www.kaggle.com/c/titanic/data) for budding machine learning enthusiasts to use. The training dataset tells us who survived and who didn't so we can use it to train our model. The test set doesn't tell us the fate of the passengers - that's what we're supposed to predict!

After downloading the datasets, we import the data and a few libraries that we will use later on.  

In [8]:
# for data analysis
import pandas as pd
import numpy as np

# for visualizations
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# for machine learning
from sklearn.ensemble import RandomForestClassifier

# import data 
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

<hr>
## Exploring the data

Now let's take a look at the data:

In [27]:
display(train_data)
print('_'*125)
display(test_data)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


_____________________________________________________________________________________________________________________________


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
5,897,3,"Svensson, Mr. Johan Cervin",male,14.0,0,0,7538,9.2250,,S
6,898,3,"Connolly, Miss. Kate",female,30.0,0,0,330972,7.6292,,Q
7,899,2,"Caldwell, Mr. Albert Francis",male,26.0,1,1,248738,29.0000,,S
8,900,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18.0,0,0,2657,7.2292,,C
9,901,3,"Davies, Mr. John Samuel",male,21.0,2,0,A/4 48871,24.1500,,S


#### What do **Pclass**, **SibSp**, and **Parch** mean? <br>
According to Kaggle, **Pclass** = passenger class, **Sibsp** = # of siblings/spouses aboard the Titanic, **Parch** = # of parents/children aboard the Titanic. 
<br>

#### What are the important feature types? 
- Categorical features: **Survived, Sex, Embarked, Pclass**
- Numerical features:
  - Discrete: **SibSp, Parch**
  - Continuous : **Fare, Age**
- Alphanumeric features: **Cabin, Ticket**


In [10]:
# to find out size of data
print("training data dimensions:", train_data.shape)

training data dimensions: (891, 12)


In [11]:
# What are the data types? Are there missing values?
train_data.info()
print('_'*125)
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
_____________________________________________________________________________________________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp     

#### What are the missing values? 
- From training set:
    - 687 missing **Cabin** values
    - 177 missing **Age** values
    - 2 missing **Embarked** values
- From test set:
    - 327 missing **Cabin** values
    - 86 missing **Age** values

In [29]:
# Summarize integer and float type features
train_data.describe()
# Show more details about specific features
#data_train.describe(percentiles=[], include=["Pclass"])
#data_train.describe(percentiles=[], include=["Pclass"])
#data_train.describe(percentiles=[], include=["Pclass"])
#data_train.describe(percentiles=[], include=["Pclass"])
#data_train.describe(percentiles=[], include=["Pclass"])"""

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


#### What to numerical features tell us?
- The sample's survival rate is ~38% which is a bit higher than the actual 32%
- Most passengers on board were in 3rd class, while less that 25% where in 1st class
- More than 75% of passengers were less that 38 years old and the mean age was 30 
- More than 75% of passengers did not travel with their kids or their parents


In [12]:
# Summarize object type features
train_data.describe(include=["O"])

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"de Mulder, Mr. Theodore",male,1601,C23 C25 C27,S
freq,1,577,7,4,644


### What this tells us

Hmm, it looks like we have some missing em ages and lots of missing cabin numbers. We will have to figure out ways to deal with those later on. 

<hr>
## Preparing the data for machine learning algorithms
First we start by combining the training dataset and the test dataset so that we can edit them both together and ensure they end up in the same format. 


In [15]:
backup_copy = train_data
# Store the survival data
targets = train_data.Survived

# Drop survival data so training set and test set have the same shape and can be combined
train_data_dropped = train_data.drop(["Survived"], 1)

# Combine train and test data
combined_data = train_data_dropped.append(test_data)

combined_data.shape

(1309, 11)

The training data had 891 entries and the test data had 418 entries. $418 + 891 = 1309$ entires, so this looks good!

### Extracting passenger titles: 

In [20]:
combined_data.sample(10)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
510,511,3,"Daly, Mr. Eugene Patrick",male,29.0,0,0,382651,7.75,,Q
125,126,3,"Nicola-Yarred, Master. Elias",male,12.0,1,0,2651,11.2417,,C
312,1204,3,"Sadowitz, Mr. Harry",male,,0,0,LP 1588,7.575,,S
453,454,1,"Goldenberg, Mr. Samuel L",male,49.0,1,0,17453,89.1042,C92,C
158,159,3,"Smiljanic, Mr. Mile",male,,0,0,315037,8.6625,,S
107,108,3,"Moss, Mr. Albert Johan",male,,0,0,312991,7.775,,S
330,1222,2,"Davies, Mrs. John Morgan (Elizabeth Agnes Mary...",female,48.0,0,2,C.A. 33112,36.75,,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
131,1023,1,"Gracie, Col. Archibald IV",male,53.0,0,0,113780,28.5,C51,C
276,1168,2,"Parker, Mr. Clifford Richard",male,28.0,0,0,SC 14888,10.5,,S


Note in the above sample that passenger titles are always preceded by a comma and followed by a period. So we can create a function that splits the Name value at the comma and at the period to get the title.

In [26]:
# Extract titles from names and place them in a new column
combined_data["Title"] = combined_data["Name"].map(lambda name: name.split(",")[1].split(".")[0].strip())

# Drop names column because we no longer need it
combined_data.drop(["Name"], 1, inplace=True)

# Show all different values in the Title column
combined_data.Title.value_counts()

Mr              757
Miss            260
Mrs             197
Master           61
Rev               8
Dr                8
Col               4
Mlle              2
Major             2
Ms                2
Dona              1
Mme               1
the Countess      1
Jonkheer          1
Capt              1
Don               1
Lady              1
Sir               1
Name: Title, dtype: int64

Looks like it worked!
<br>
Here we don't need to worry about "binning" similar titles together because tree based models like the random forest classifier can decide for themselves which binning is most useful for predicting outcomes. 

### Processing passenger ages:

If you recall from the data exploration step, there were about 177 and 86 missing age values from the training set and the data set respectively. We know that age is an important factor in determining survival so we need to come up with a way to fill in the missing ages.

Here we are going to group people by their gender, class, and title and then use these groupings to determine the missing values. This approach was used by Ahmed Besbes [here]("https://ahmedbesbes.com/how-to-score-08134-in-titanic-kaggle-challenge.html").

In [33]:
# Select train data and group by Sex, class, and title in that order
grouped_train_data = combined_data.head(891).groupby(["Sex","Pclass", "Title"])

# Select test data and group by Sex, class, and title in that order
grouped_test_data = combined_data.iloc[891:].groupby(["Sex", "Pclass", "Title"])

# Find and display medians 
display(grouped_train_data.median())
print('_'*125)
display(grouped_test_data.median())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,PassengerId,Age,SibSp,Parch,Fare
Sex,Pclass,Title,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
female,1,Dr,797.0,49.0,0.0,0.0,25.9292
female,1,Lady,557.0,48.0,1.0,0.0,39.6
female,1,Miss,349.5,30.0,0.0,0.0,91.75
female,1,Mlle,676.5,24.0,0.0,0.0,59.4021
female,1,Mme,370.0,24.0,0.0,0.0,69.3
female,1,Mrs,506.5,41.5,1.0,0.0,79.425
female,1,the Countess,760.0,33.0,0.0,0.0,86.5
female,2,Miss,437.5,24.0,0.0,0.0,13.0
female,2,Mrs,438.0,32.0,1.0,0.0,26.0
female,2,Ms,444.0,28.0,0.0,0.0,13.0


_____________________________________________________________________________________________________________________________


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,PassengerId,Age,SibSp,Parch,Fare
Sex,Pclass,Title,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
female,1,Dona,1306.0,39.0,0.0,0.0,108.9
female,1,Miss,1074.0,32.0,0.0,0.0,158.20835
female,1,Mrs,1076.0,48.0,1.0,0.0,63.3583
female,2,Miss,1121.0,19.5,1.0,1.0,24.5
female,2,Mrs,1123.5,29.0,0.0,0.0,26.0
female,3,Miss,1090.5,22.0,0.0,0.0,7.8792
female,3,Mrs,1051.0,28.0,1.0,1.0,14.4542
female,3,Ms,980.0,,0.0,0.0,7.75
male,1,Col,1058.5,50.0,0.5,0.0,128.0125
male,1,Dr,1185.0,53.0,1.0,1.0,81.8583


So the function we want to create to fill in the missing ages first checks the passenger's age, then their class, then their title and uses that info to determine what age to give them. If we haven't seen that title before, we should just plug in the median age.

 <div class="alert alert-block alert-warning">Note that we have to be super careful not to introduce any information from the training data into the test data. The point of a predictive machine learning model is to make accurate predictions about new data that is *unseen* during the training.</div>

In [None]:
# function that fills in missing ages
def age_filler(row, grouped_median):
    
    

In [34]:
%store

Stored variables and their in-db values:
