# Machine Learning tutorial: Predicting Titanic passenger survival 


This tutorial is adapted from https://www.kaggle.com/omarelgabry/titanic/a-journey-through-titanic. It was forked and modified by MSc. Benjamin Tovar (https://www.linkedin.com/in/benjamintovarcis/) on November 2015.

## Introduction

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

Source: https://www.kaggle.com/c/titanic

## Data structure and description

<code>
survival        Survival
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower
</code>

## Tutorial Goals

The tutorial serves as an introduction to the Scikit-learn Machine Learning Python awesome library: http://scikit-learn.org/stable/index.html 

### Awesome tutorials and more information:
http://www.astroml.org/sklearn_tutorial/ 

### Prepare workspace

In [70]:
# Import libraries
# pandas
import pandas as pd
from pandas import Series,DataFrame
# numpy
import numpy as np
# Machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

### Load data

In [71]:
# get titanic & test csv files as a DataFrame
titanic_df = pd.read_csv("data/train.csv", dtype={"Age": np.float64}, )
test_df    = pd.read_csv("data/test.csv", dtype={"Age": np.float64}, )

# preview the data
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


### Drop columns

In [72]:
# drop unnecessary columns, these columns won't be useful in analysis and prediction
titanic_df = titanic_df.drop(['PassengerId','Name','Ticket'], axis=1)
test_df    = test_df.drop(['Name','Ticket'], axis=1)

### Processing a meta-feature (artificial feature)

In [73]:
# Sex

# As already known, children(age < ~16) on aboard seem to have a high chances for Survival.
# So, we can classify passengers as males, females, and child
def get_person(passenger):
    age,sex = passenger
    return 'child' if age < 16 else sex
    
titanic_df['Person'] = titanic_df[['Age','Sex']].apply(get_person,axis=1)
test_df['Person']    = test_df[['Age','Sex']].apply(get_person,axis=1)

# create dummy variables for Person column, & drop Male as it has the lowest average of survived passengers
person_dummies_titanic  = pd.get_dummies(titanic_df['Person'])
person_dummies_titanic.columns = ['Male','Female','Child']
# person_dummies_titanic.drop(['Male'], axis=1, inplace=True)

person_dummies_test  = pd.get_dummies(test_df['Person'])
person_dummies_test.columns = ['Male','Female','Child']
# person_dummies_test.drop(['Male'], axis=1, inplace=True)

titanic_df = titanic_df.join(person_dummies_titanic)
test_df    = test_df.join(person_dummies_test)

### Exploring likelihood of survival given gender and age -> `Person` feature

In [74]:
# Count the number of persons for each category (child, adult female or adult male)
family_survived = titanic_df[["Person", "Survived"]].groupby(['Person'],as_index=False)
family_survived.count()

Unnamed: 0,Person,Survived
0,child,83
1,female,271
2,male,537


In [75]:
# Count the number of persons that survived given each category
family_survived.Survived.sum()

Unnamed: 0,Person,Survived
0,child,49
1,female,205
2,male,88


In [76]:
# Count the proportion in % of persons that survived vs count for each category
family_survived_perc = family_survived.Survived.sum().Survived / family_survived.Survived.count().Survived * 100
family_survived_perc_df = pd.DataFrame(family_survived_perc)
family_survived_perc_df.columns = ["Survived %"]
family_survived_perc_df.index = ["Child","Female","Male"]
family_survived_perc_df

Unnamed: 0,Survived %
Child,59.036145
Female,75.645756
Male,16.387337


## Task 1: Perform the same analysis for Class of passenger: `Pclass` feature

In [77]:
# Compute the total count of persons belonging to each class
pclass_survived = titanic_df[["Pclass", "Survived"]].groupby(['Pclass'],as_index=False)
pclass_survived.count()
# OUTPUT MUST BE:
#  Pclass Survived
#  1       216
#  2       184
#  3       491

Unnamed: 0,Pclass,Survived
0,1,216
1,2,184
2,3,491


In [78]:
# Compute the count of persons that survived belonging to each class 
pclass_survived.Survived.sum()
# OUTPUT MUST BE:
#  Pclass Survived
#  1       136
#  2       87
#  3       119

Unnamed: 0,Pclass,Survived
0,1,136
1,2,87
2,3,119


In [79]:
# Compute the % of persons that survived versus the total for each class
pclass_survived_perc = pclass_survived.Survived.sum().Survived / pclass_survived.Survived.count().Survived * 100
pclass_survived_perc_df = pd.DataFrame(pclass_survived_perc)
pclass_survived_perc_df.columns = ["Survived %"]
pclass_survived_perc_df.index = ["1st","2nd","3rd"]
pclass_survived_perc_df
# OUTPUT MUST BE:
#       Survived %
#  1st  62.962963
#  2nd  47.282609
#  3rd  24.236253

Unnamed: 0,Survived %
1st,62.962963
2nd,47.282609
3rd,24.236253


### Task 1 evaluation:

In [80]:
# Evaluate answer!
x = pclass_survived_perc_df[2:3]
if round(x.sum()) == 24.0:
    print "Success!, Task 1 completed and one step closer to chilaquiles"
else:
    print "Sorry, try again or there won't be chilaquiles for ya!"

Success!, Task 1 completed and one step closer to chilaquiles


## Constructing the classifier (Preparing data)

### Drop features

In [81]:
# set features to drop
features_to_drop = ["Person","Sex","Cabin","Embarked"]
# drop features for train and test
titanic_df.drop(features_to_drop,axis=1,inplace=True)
test_df.drop(features_to_drop,axis=1,inplace=True)

In [82]:
# Explore train set
titanic_df.head(n=5)

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Male,Female,Child
0,0,3,22,1,0,7.25,0,0,1
1,1,1,38,1,0,71.2833,0,1,0
2,1,3,26,0,0,7.925,0,1,0
3,1,1,35,1,0,53.1,0,1,0
4,0,3,35,0,0,8.05,0,0,1


## Define training and testing sets

In [83]:
# Train set (droping rows with NULL values)
X_train = titanic_df.dropna()
# set train set labels (event of survival: 1, event of no-survival: 0)
Y_train = X_train["Survived"]
# drop labels from training set, given that label is the dependant feature
# and we want to predict label given predictors or independent features
X_train = X_train.drop("Survived",axis=1)

# Test set (droping rows with NULL values)
# make a copy of PassengerId
X_test  = test_df.drop("PassengerId",axis=1).dropna().copy()

### Explore training set

In [84]:
X_train.head(n=5)

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Male,Female,Child
0,3,22,1,0,7.25,0,0,1
1,1,38,1,0,71.2833,0,1,0
2,3,26,0,0,7.925,0,1,0
3,1,35,1,0,53.1,0,1,0
4,3,35,0,0,8.05,0,0,1


### Explore test set

In [85]:
X_test.head(n=5)

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Male,Female,Child
0,3,34.5,0,0,7.8292,0,0,1
1,3,47.0,1,0,7.0,0,1,0
2,2,62.0,0,0,9.6875,0,0,1
3,3,27.0,0,0,8.6625,0,0,1
4,3,22.0,1,1,12.2875,0,1,0


## Constructing the classifier (Machine Learing models)

### Random Forest 

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).

In [86]:
# create a classifier object initialized with our parameters
random_forest = RandomForestClassifier(n_estimators=100)
# fit labels to predictors (in other words, use predictors to infer label)
random_forest.fit(X_train, Y_train)
# Make predictions on predictors from test set
# a prediction has a value of 1 when the model predicts that a person will survive
# and 0 otherwise
Y_pred_rf = random_forest.predict(X_test)
# compute the performance of the learning algorithm
random_forest.score(X_train, Y_train)

0.98599439775910369

### Nearest Neighbors Classification

http://scikit-learn.org/stable/modules/neighbors.html

Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point.

scikit-learn implements two different nearest neighbors classifiers: KNeighborsClassifier implements learning based on the k nearest neighbors of each query point, where k is an integer value specified by the user. RadiusNeighborsClassifier implements learning based on the number of neighbors within a fixed radius r of each training point, where r is a floating-point value specified by the user.

The k-neighbors classification in KNeighborsClassifier is the more commonly used of the two techniques. The optimal choice of the value k is highly data-dependent: in general a larger k suppresses the effects of noise, but makes the classification boundaries less distinct.

In [87]:
# create a classifier object initialized with our parameters
knn = KNeighborsClassifier(n_neighbors = 3)
# fit labels to predictors (in other words, use predictors to infer label)
knn.fit(X_train, Y_train)
# Make predictions on predictors from test set
# a prediction has a value of 1 when the model predicts that a person will survive
# and 0 otherwise
Y_pred_knn = knn.predict(X_test)
# compute the performance of the learning algorithm
knn.score(X_train, Y_train)

0.83193277310924374

### Support Vector Machines

http://scikit-learn.org/stable/modules/svm.html

Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.

The advantages of support vector machines are:

- Effective in high dimensional spaces.
- Still effective in cases where number of dimensions is greater than the number of samples.
- Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
- Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.

The disadvantages of support vector machines include:

- If the number of features is much greater than the number of samples, the method is likely to give poor performances.
- SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation.

In [88]:
# create a classifier object initialized with our parameters
# SVC -> Support Vector Classification
svc = SVC()
# fit labels to predictors (in other words, use predictors to infer label)
svc.fit(X_train, Y_train)
# Make predictions on predictors from test set
# a prediction has a value of 1 when the model predicts that a person will survive
# and 0 otherwise
Y_pred_svc = svc.predict(X_test)
# compute the performance of the learning algorithm
svc.score(X_train, Y_train)

0.90756302521008403

### Compute Benchmark score

In [89]:
# benchmark score is the "agreement" score between the 3 models
benchmark_score = (Y_pred_rf + Y_pred_knn + Y_pred_svc) / float(3)

## Export prediction output

In [90]:
# construct data frame
submission = pd.DataFrame({
        "1_PassengerId": test_df.dropna().PassengerId,
        "2_Survived_(RandomForest)": Y_pred_rf,
        "3_Survived_(KNN)": Y_pred_knn,
        "4_Survived_(SVM)": Y_pred_svc,
        "5_Benchmark_score": benchmark_score
        
    })
# export to CSV
submission.to_csv('titanic_prediction_benchmark.csv', index=False)

In [91]:
# Show output in notebook
submission.head(n=10)

Unnamed: 0,1_PassengerId,2_Survived_(RandomForest),3_Survived_(KNN),4_Survived_(SVM),5_Benchmark_score
0,892,0,0,0,0.0
1,893,0,0,0,0.0
2,894,1,1,1,1.0
3,895,1,1,0,0.666667
4,896,0,0,0,0.0
5,897,0,1,1,0.666667
6,898,0,0,0,0.0
7,899,0,1,1,0.666667
8,900,1,0,1,0.666667
9,901,0,1,1,0.666667
