# Titanic: Machine Learning From Disaster
### by Sung Ahn and Abdul Saleh

<hr>
## Introduction
In this project, we use random forests and (gradient boosting) machine learning algorithms to predict who survived the sinking of the RMS Titanic. On our journey to achieving this goal, we go through the whole data science process from understanding the problem and getting the data to fine-tuning our models and visualizing our results. 

The Titanic dataset is perhaps the most widely analyzed dataset of all time. There exists a wealth of incredible tutorials online exploring different approaches to analyzing this dataset. So in our own analysis, we draw on the experiences of the huge community of amazing people who have already attempted this problem and shared their conclusions online. We would especially like to thank [Manav Sehgal](https://www.kaggle.com/startupsci/titanic-data-science-solutions), [Jeff Delaney](https://www.kaggle.com/jeffd23/scikit-learn-ml-from-start-to-finish), and [Ahmed Besbes](https://ahmedbesbes.com/how-to-score-08134-in-titanic-kaggle-challenge.html) whom without their insight this project would not have become a reality.    

<hr>
## Outline 
1. Understanding the problem
2. Getting the data
3. Exploring the data 
4. Picking a machine learning algorithm 
5. Preparing the data for machine learning algorithms
6. Training algorithm and fine-tuning model
7. Visualizing results and presenting solution 


<hr>
## Understanding the problem
Before we dive into the data analysis and algorithms, we first ask ourselves: is this even a problem that can be solved with machine learning? <br>
Luckily for us, lots of books have been written about the sinking of the Titanic. So before we look at the data, we do some background reading and discover that some patterns might exist, most notably: 

1. Women and children generally got first priority on the life boats 
    - This tells us that we should look for 
2. 
    -
3. There was a lot of confusion during the sinking of the ship and people chose to stay on the Titanic for arbitrary reasons. 
    - There are definitley anomalies in this dataset because it is clear that some people 
    through we conclused th

Aha, so it seems like a there are some patterns that can help us figure out who survived and who didn't. This looks like a great machine learning problem!

<hr>
## Getting the data
Kaggle, a platform for data science competitions, has kindly compiled a dataset that is perfect for our needs and put it on their [website](https://www.kaggle.com/c/titanic/data) for budding machine learning enthusiasts to use. The training dataset tells us who survived and who didn't so we can use it to train our model. The test set doesn't tell us the fate of the passengers - that's what we're supposed to predict!

After downloading the datasets, we import the data and a few libraries that we will use later on.  

In [1]:
# for data analysis
import pandas as pd
import numpy as np

# for visualizations
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# for machine learning
from sklearn.ensemble import RandomForestClassifier

# import data 
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

<hr>
## Exploring the data

Now let's take a look at the data:

In [3]:
display(train_data.head())
print('_'*125)
display(test_data.head())

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


_____________________________________________________________________________________________________________________________


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


#### What do **Pclass**, **SibSp**, and **Parch** mean? <br>
According to Kaggle, **Pclass** = passenger class, **Sibsp** = # of siblings/spouses aboard the Titanic, **Parch** = # of parents/children aboard the Titanic. 
<br>

#### What are the important feature types? 
- Categorical features: **Survived, Sex, Embarked, Pclass**
- Numerical features:
  - Discrete: **SibSp, Parch**
  - Continuous : **Fare, Age**
- Alphanumeric features: **Cabin, Ticket**


In [4]:
# to find out size of data
print("training data dimensions:", train_data.shape)

training data dimensions: (891, 12)


In [5]:
# What are the data types? Are there missing values?
train_data.info()
print('_'*125)
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
_____________________________________________________________________________________________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp     

#### What are the missing values? 
- From training set:
    - 687 missing **Cabin** values
    - 177 missing **Age** values
    - 2 missing **Embarked** values
- From test set:
    - 327 missing **Cabin** values
    - 86 missing **Age** values

In [None]:
# Summarize integer and float type features
train_data.describe()
# Show more details about specific features
#data_train.describe(percentiles=[], include=["Pclass"])
#data_train.describe(percentiles=[], include=["Pclass"])
#data_train.describe(percentiles=[], include=["Pclass"])
#data_train.describe(percentiles=[], include=["Pclass"])
#data_train.describe(percentiles=[], include=["Pclass"])"""

#### What to numerical features tell us?
- The sample's survival rate is ~38% which is a bit higher than the actual 32%
- Most passengers on board were in 3rd class, while less that 25% where in 1st class
- More than 75% of passengers were less that 38 years old and the mean age was 30 
- More than 75% of passengers did not travel with their kids or their parents


In [4]:
# Summarize object type features
train_data.describe(include=["O"])

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Levy, Mr. Rene Jacques",male,CA. 2343,B96 B98,S
freq,1,577,7,4,644


### What this tells us

Hmm, it looks like we have some missing em ages and lots of missing cabin numbers. We will have to figure out ways to deal with those later on. 

<hr>
## Preparing the data for machine learning algorithms
First we start by combining the training dataset and the test dataset so that we can edit them both together and ensure they end up in the same format. 


In [108]:
backup_copy = train_data
# Store the survival data
targets = train_data.Survived

# Drop survival data so training set and test set have the same shape and can be combined
train_data_dropped = train_data.drop(["Survived"], 1)

# Combine train and test data
combined_data = train_data_dropped.append(test_data)

combined_data.shape

(1309, 11)

The training data had 891 entries and the test data had 418 entries. $418 + 891 = 1309$ entires, so this looks good!

### Extracting passenger titles: 

If you look back at the data you will notice that passenger titles are always preceded by a comma and followed by a period. So we can create a function that splits the Name value at the comma and at the period to get the title.

In [126]:
# Extract titles from names and place them in a new column
combined_data["Title"] = combined_data["Name"].map(lambda name: name.split(",")[1].split(".")[0].strip())

# Drop names column because we no longer need it
combined_data.drop(["Name"], 1, inplace=True)

# Show all different values in the Title column
combined_data.Title.value_counts()


Mr              757
Miss            260
Mrs             197
Master           61
Dr                8
Rev               8
Col               4
Ms                2
Mlle              2
Major             2
Don               1
the Countess      1
Lady              1
Dona              1
Sir               1
Mme               1
Jonkheer          1
Capt              1
Name: Title, dtype: int64

In [8]:
#back up copy
back_up2 = combined_data.copy()

Looks like it worked!
<br>
Here we don't need to worry about "binning" similar titles together because tree based models like the random forest classifier can decide for themselves which binning is most useful for predicting outcomes. 

### Processing passenger ages:

If you recall from the data exploration step, there were about 177 and 86 missing age values from the training set and the data set respectively. We know that age is an important factor in determining survival so we need to come up with a way to fill in the missing ages.

Here we are going to group people by their gender, class, and title and then use these groupings to determine the missing values.

In [28]:
# Select train data and group by Sex, class, and title in that order
grouped_train_data = combined_data.head(891).groupby(["Sex","Pclass", "Title"])

# Select test data and group by Sex, class, and title in that order
grouped_test_data = combined_data.iloc[891:].groupby(["Sex", "Pclass", "Title"])

# Find and display medians 
display(grouped_train_data.median())
print('_'*125)
display(grouped_test_data.median())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,PassengerId,Age,SibSp,Parch,Fare
Sex,Pclass,Title,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
female,1,Dr,797.0,49.0,0.0,0.0,25.9292
female,1,Lady,557.0,48.0,1.0,0.0,39.6
female,1,Miss,349.5,30.0,0.0,0.0,91.75
female,1,Mlle,676.5,24.0,0.0,0.0,59.4021
female,1,Mme,370.0,24.0,0.0,0.0,69.3
female,1,Mrs,506.5,41.5,1.0,0.0,79.425
female,1,the Countess,760.0,33.0,0.0,0.0,86.5
female,2,Miss,437.5,24.0,0.0,0.0,13.0
female,2,Mrs,438.0,32.0,1.0,0.0,26.0
female,2,Ms,444.0,28.0,0.0,0.0,13.0


_____________________________________________________________________________________________________________________________


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,PassengerId,Age,SibSp,Parch,Fare
Sex,Pclass,Title,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
female,1,Dona,1306.0,39.0,0.0,0.0,108.9
female,1,Miss,1074.0,32.0,0.0,0.0,158.20835
female,1,Mrs,1076.0,48.0,1.0,0.0,63.3583
female,2,Miss,1121.0,19.5,1.0,1.0,24.5
female,2,Mrs,1123.5,29.0,0.0,0.0,26.0
female,3,Miss,1090.5,22.0,0.0,0.0,7.8792
female,3,Mrs,1051.0,28.0,1.0,1.0,14.4542
female,3,Ms,980.0,,0.0,0.0,7.75
male,1,Col,1058.5,50.0,0.5,0.0,128.0125
male,1,Dr,1185.0,53.0,1.0,1.0,81.8583


So the function we want to create to fill in the missing ages first checks the passenger's age, then their class, then their title and uses that info to determine what age to give them. If we haven't seen that title before, we should just plug in the median age.

 <div class="alert alert-block alert-warning">Note that we have to be super careful not to introduce any information from the test data into the training data. The point of a predictive machine learning model is to make accurate predictions about new data that is *unseen* during the training.</div>

In [12]:
from fancyimpute import KNN

In [106]:
from math import ceil
from fancyimpute import KNN


# Function to round approximations of missing ages to nearest 0.5             
def round_age(age):
    return round(age * 2) / 2


# function that fills in missing ages, true if filling training data, false otherwise
def age_filler(incomplete_data, Training=True, *size):
    
    for Sex in grouped_train_data["Sex"]:
        for Pclass in grouped_train_data["Pclass"]: 
            for Title in grouped_train_data["Title"]:
                
                # This is a data frame the has passengers sharing the same gender, class, and title
                subset_incomplete_data = incomplete_data.loc[(incomplete_data["Sex"]==Sex) 
                                                             & (incomplete_data["Pclass"]==Pclass) 
                                                             & (incomplete_data["Title"]==Title)]
                
                # Extract age and fare columns
                subset_incomplete_data = subset_incomplete_data[["Age", "Fare"]]
                
                # Array of global indexes of passengers in subset with missing ages
                missing_ages_global_index = subset_incomplete_data.index[subset_incomplete_data["Age"].isnull()].tolist()
                missing_ages_global_index.sort()
                
                # Array of local indexes of passengers in subset with missing ages
                missing_ages_local_index = np.where(subset_incomplete_data.isnull())[0]
                missing_ages_local_index.sort()
                
                # Get number of passengers in subset_train_data, use this number to calculate number of KNN neighbours 
                subset_incomplete_data_size = sub_train_data.shape()[0]

                # Use KNN to fill in missing ages based on similarities between passegers' fares
                # Returns a numpy array of ages
                subset_complete_data = KNN(k=ceil(subset_incomplete_data_size*0.05)).complete(subset_incomplete_data)
                
                # Condition to ensure that the training set is used to estimate ages of test set, but not the other way around
                if Training != True:
                    # This counts the number of missing ages in subset from training set
                    counter = 0
                    
                    # Remove indeces inside of training set because we do not want to influence the training set
                    for passenger_global_index in missing_ages_global_index:
                        if passenger_global_index <= size:
                            missing_ages_global_index.remove(passenger_global_index)
                            counter+=1
                    # Remove local indeces referring to missing ages in training set
                    missing_ages_local_index[counter:]
                    
                counter = 0
                # Iterate over estimates ages in subset and place them back in the original dataset
                # Use indexes to match values from subset to original data set
                for passenger_local_index in missing_ages_local_index:
                    passenger_estimated_age = round_age(subset_complete_data[passenger_local_index][0])
                    passenger_global_index = missing_ages_global_index[counter]
                    counter+=1                    
                    incomplete_data["Age"].iloc[passenger_global_index] = passenger_estimated_age
    return incomplete_data

                
 

In [None]:
               
# function that finds median ages
#def median_age_finder(row, grouped_median_table):
#    return grouped_median_table.loc["Sex", "Class", "Title"]


#Here we are going to split up the data by sex, class, and title and then we will use fancyimputer to give 
# each passenger with a missing age the age of the person that is most similar to him/her using KNN


#Store ids of people with missing ages so ids can be replaces 
                
                #display(subset_complete_data)
                # Round estimated ages to nearest 0.5 years
                #for passenger in subset_complete_data:
                
                # Extract ages from array of completed data and fill missing ages in original dataset
                #for local_index in missing_ages_local_index:
                    # Find 
                
                
                
                #print(sub_train_filled)
            
#for Sex in grouped_train_data["Sex"]:
#    for Pclass in grouped_train_data["Pclass"]: 
#        for Title in grouped_train_data["Title"]:  
            

            
# This is a data frame the has passengers sharing the same gender, class, and title
#sub_train_data = combined_data.loc[combined_data["Pclass"]==1]   
#combined_data.loc[:,["Sex", "Title"]]

#sub_train_data = sub_train_data.loc[sub_train_data["Pclass"]==1]
#sub_train_data = sub_train_data.loc[sub_train_data["Title"]=="Mr"] 
#sub_train_data = sub_train_data.loc[]

#sub_train_data = combined_data.loc[(combined_data["Sex"]=="male") & (combined_data["Pclass"]==1) & (combined_data["Title"]=="Mr")]
#sub_train_data = sub_train_data[["Age", "Fare"]]

# Get number of passengers in sub_train_data
#sub_train_data_size = sub_train_data.shape[0]
#print(sub_train_data_size)
#sub_train_data.info()

# Use KNN method to fill in missing ages based on similarities between passegers' fares
#sub_train_data_filled_2 = KNN(k=ceil(sub_train_data_size*0.05)).complete(sub_train_data)
#display(sub_train_data_


# fill in missing values in test data based on medians found in train data

#def age_filler()

#from fancyimpute import KNN
# let's try KNN
#my_trial = combined_data.head(891)[["Sex", "Pclass", "Title", "Fare", "Age"]]
#display(my_trial)
#my_trial_filled = KNN(k=10).complete(my_trial)

In [132]:
#for i in range(sub_train_data.shape[0]):
#    if sub_train_data["Age"].isnull():
#    print(i)

#sub_train_data["Age"].isnull()


#sub_train_data.iterrows()
#np.where(pd.isnull(sub_train_data))
#index = np.where(sub_train_data.isnull())[0].tolist()

#local_index_list = []
#for local_index in sub_train_data["Age"].isnull(): 
#    local_index_list += local_index




ages_global_index = sub_train_data.index[sub_train_data["Age"].isnull()].tolist()
ages_global_index.sort()

index = np.where(sub_train_data.isnull())[0]
index.sort()

display(ages_global_index)
display(index)
display(sub_train_data_filled_2)

c = 0
for passenger_local_index in index: #in local index
    passenger_estimated_age = sub_train_data_filled_2[passenger_local_index][0]
    passenger_global_index = ages_global_index[c]
    c+=1
    combined_data["Age"].iloc[passenger_global_index] = passenger_estimated_age

display(combined_data.iloc[41:67])


[41,
 55,
 64,
 146,
 148,
 168,
 185,
 191,
 205,
 266,
 270,
 284,
 290,
 295,
 298,
 351,
 475,
 507,
 527,
 557,
 602,
 633,
 711,
 740,
 793,
 815,
 839]

array([  6,   8,  20,  23,  31,  33,  34,  35,  40,  55,  61,  64,  69,
        73,  79,  90,  94,  99, 101, 102, 112, 125, 126, 131, 132, 140, 143], dtype=int64)

array([[  54.        ,   51.8625    ],
       [  28.        ,   35.5       ],
       [  19.        ,  263.        ],
       [  28.        ,   82.1708    ],
       [  42.        ,   52.        ],
       [  65.        ,   61.9792    ],
       [  36.50003126,   35.5       ],
       [  45.        ,   83.475     ],
       [  35.74663693,   27.7208    ],
       [  28.        ,   47.1       ],
       [  46.        ,   61.175     ],
       [  71.        ,   34.6542    ],
       [  23.        ,   63.3583    ],
       [  21.        ,   77.2875    ],
       [  47.        ,   52.        ],
       [  24.        ,  247.5208    ],
       [  54.        ,   77.2875    ],
       [  37.        ,   53.1       ],
       [  24.        ,   79.2       ],
       [  51.        ,   61.3792    ],
       [  44.01026402,   25.925     ],
       [  61.        ,   33.5       ],
       [  56.        ,   30.6958    ],
       [  54.9999311 ,   50.        ],
       [  45.        ,   26.55      ],
       [  40.        ,   

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title
41,42,2,female,36.500031,1,0,11668,21.0,,S,Mrs
42,43,3,male,,0,0,349253,7.8958,,C,Mr
43,44,2,female,3.0,1,2,SC/Paris 2123,41.5792,,C,Miss
44,45,3,female,19.0,0,0,330958,7.8792,,Q,Miss
45,46,3,male,,0,0,S.C./A.4. 23567,8.05,,S,Mr
46,47,3,male,,1,0,370371,15.5,,Q,Mr
47,48,3,female,,0,0,14311,7.75,,Q,Miss
48,49,3,male,,2,0,2662,21.6792,,C,Mr
49,50,3,female,18.0,1,0,349237,17.8,,S,Mrs
50,51,3,male,7.0,4,1,3101295,39.6875,,S,Master


In [82]:
display(sub_train_data)

Unnamed: 0,Age,Fare
6,54.0,51.8625
23,28.0,35.5000
27,19.0,263.0000
34,28.0,82.1708
35,42.0,52.0000
54,65.0,61.9792
55,,35.5000
62,45.0,83.4750
64,,27.7208
83,28.0,47.1000


In [1]:
%store -r

In [2]:
display(grouped_test_data.median())

NameError: name 'grouped_test_data' is not defined