# Titanic Survival Prediction

This notebook contains an approach to the problem of predicting whether a passenger from the Titanic has survived the sinking, according to the database posted by Kaggle as one of its [training challenges](https://www.kaggle.com/c/titanic). It shall be used as a testbed for practical machine learning approaches in data preprocessing, model training and result evaluation. As such, the author's aim is to make this notebook as clear and didactical possible.

The notebook shall also be used for Python 3 practice: most of my projects have used Python 2, and as a consequence some of the code may be archaic or not follow the optimal approach. However, that should be fixed with time.

## TO-DO
Describe the data.

## Preprocessing

In most real-world(-ish) situations, it is possible that the data sets don't have all the values for all the features registered. Even worse, it is possible that some values are mistakenly recorded or even tampered with. Those errors and omissions can impact the result of our model; therefore, one must have a way to handle such values.

It is also possible that some data is registered correctly, but not in a form that can be adequately understood by the model one wants to use. For example, a model might not be able to understand categorical values unless the are somehow encoded numerically. Those data transformations must also be performed before a model can be fit to the data.

Preprocessing the data is a task that can benefit from knowledge about the problem's domain.

This notebook shall use the `pandas` library for dealing with the dataset.

In [4]:
import pandas as pd

In [53]:
df_tr = pd.read_csv('./data/train.csv')

In [54]:
df_tr.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [55]:
df_tr.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


As one can see from the table description above, some values of the `Age` field are missing. 
A trivial approach for this problem would be to drop the rows without an `Age` value, but, 
given the number of lines that fit the condition, this might not be the best strategy.
Instead, one could fill the missing values with the column average.

However, as the line below shows, passengers from different classes and sexes have different age profiles.
The same happens when one takes into account the port where each passenger embarked,
but here one has to take into account the size of the resulting segments: 
since some of them have so few passengers, replacing the missing values might introduce distortions.

In [56]:
df_tr.groupby(['Embarked', 'Pclass', 'Sex']).apply(lambda x: x['Age'].mean())

Embarked  Pclass  Sex   
C         1       female    36.052632
                  male      40.111111
          2       female    19.142857
                  male      25.937500
          3       female    14.062500
                  male      25.016800
Q         1       female    33.000000
                  male      44.000000
          2       female    30.000000
                  male      57.000000
          3       female    22.850000
                  male      28.142857
S         1       female    32.704545
                  male      41.897188
          2       female    29.719697
                  male      30.875889
          3       female    23.223684
                  male      26.574766
dtype: float64

In [57]:
df_tr.groupby(['Embarked', 'Pclass']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,PassengerId,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin
Embarked,Pclass,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
C,1,85,85,85,85,74,85,85,85,85,66
C,2,17,17,17,17,15,17,17,17,17,2
C,3,66,66,66,66,41,66,66,66,66,1
Q,1,2,2,2,2,2,2,2,2,2,2
Q,2,3,3,3,3,2,3,3,3,3,1
Q,3,72,72,72,72,24,72,72,72,72,1
S,1,127,127,127,127,108,127,127,127,127,106
S,2,164,164,164,164,156,164,164,164,164,13
S,3,353,353,353,353,290,353,353,353,353,10


There is another popular approach for filling missing values: 
from a brief inspection of the `Name` column, one can see that all names include honorifics.
Since those titles reflect a person's age and ticket class, they may be used to fill out the blanks in the `Age` column.
To do so, a new `Title` column is created, taking the first part of the name after the comma and then considering that
the title ends with a dot.

While some titles occur at most a handful of times, those "rare" titles usually have a defined `Age` value, which
means they won't interfere with the filling process.
As a consequence, the title-based filling, taking into account the passenger's class, seems more appropriate than the previous approach.

In [58]:
df_tr['Title'] = df_tr['Name'].map(lambda x: x.split(', ')[1].split('.')[0])

In [59]:
df_tr.groupby(['Title', 'Pclass']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,PassengerId,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Title,Pclass,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Capt,1,1,1,1,1,1,1,1,1,1,1,1
Col,1,2,2,2,2,2,2,2,2,2,1,2
Don,1,1,1,1,1,1,1,1,1,1,0,1
Dr,1,5,5,5,5,4,5,5,5,5,3,5
Dr,2,2,2,2,2,2,2,2,2,2,0,2
Jonkheer,1,1,1,1,1,1,1,1,1,1,0,1
Lady,1,1,1,1,1,1,1,1,1,1,1,1
Major,1,2,2,2,2,2,2,2,2,2,2,2
Master,1,3,3,3,3,3,3,3,3,3,3,3
Master,2,9,9,9,9,9,9,9,9,9,3,9


In [60]:
df_tr['Age_filled'] = df_tr.groupby(['Title', 'Pclass'])['Age'].transform(lambda x: x.fillna(x.mean()))

Many classifiers deal only with numerical values. 
However, some of the columns are expressed as categorical values, that is, they may assume one of a
handful of discrete values (such as `Sex`, coded as either `male` or `female`).
That happens even with some numerical values (such as `Pclass`, which ranges from 1 to 3).

To use such values on those classifiers, we must codify the categorical variables in a way which does not introduce
alien meanings to variables (for example, it makes no sense to do mathematical operations with `Pclass` numbers).
A common solution, which we will use, is to create dummy variables, that is,
create a new column for each possible value of the categorical variable and fill it with 1 when the variable has
the corresponding value and 0 otherwise.

This approach might be too expensive when dealing with many categorical columns and/or columns which may take one of
many values.
In our first approach, we will dummify only the `Sex`, `Embarked` and `Pclass` values, which can be done by adding a handful of
new columns.

In [61]:
df_tr = pd.get_dummies(df_tr, columns=['Pclass', 'Sex', 'Embarked'])

In [62]:
df_tr.head()

Unnamed: 0,PassengerId,Survived,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Title,Age_filled,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,1,0,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,,Mr,22.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,Mrs,38.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
2,3,1,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,,Miss,26.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,Mrs,35.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
4,5,0,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,,Mr,35.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0


The same preprocessing should be applied to the data on our test set. 
Unlike the training data, the test set has a row with no `Fare` value.
To fill it, we use the same logic employed for the `Age` field, but this time we fill it in-place.

In [98]:
df_test = pd.read_csv('./data/test.csv')

In [99]:
df_test['Title'] = df_test['Name'].map(lambda x: x.split(', ')[1].split('.')[0])

In [100]:
df_test['Age_filled'] = df_test.groupby(['Title', 'Pclass'])['Age'].transform(lambda x: x.fillna(x.mean()))
df_test['Age_filled'] = df_test['Age_filled'].fillna(df_test['Age_filled'].mean())

In [101]:
df_test['Fare'] = df_test.groupby(['Title', 'Pclass'])['Fare'].transform(lambda x: x.fillna(x.mean()))
df_test['Fare'] = df_test['Fare'].fillna(df_test['Fare'].mean())

In [102]:
df_test = pd.get_dummies(df_test, columns=['Pclass', 'Sex', 'Embarked'])

# Model Training

After transforming the original data, one can start testing models, using the `Survived` field as the prediction target.
Our first attempt will use a random forest classifier, a method that trains an ensemble of decision trees,
each using a subset of the features from the dataset, and classifies the row according to the majority decision.
After that, other models shall be implemented in order to find the most adequate for our purposes.

All those models wil be trained over a subset of the features available on the training set:
* the `PassengerId` is irrelevant for training
* the `Survived` field is used as the output, and, therefore, cannot also be an input
* the `Age` field will be replaced by its `Age_filled` variant
* the `Sex`, `Embarked`, and `Pclass` fields will be replaced by their dummyfied variants
* the `Name`, `Title`, `Ticket`, and `Cabin` fields are not yet incorporated into the model

When possible, the `scikit-learn` implementation of the relevant models will be used.

In [63]:
excluded_cols = ['PassengerId', 'Survived', 'Age', 'Sex', 'Embarked', 'Pclass', 'Name', 'Title', 'Ticket', 'Cabin']
cols = [x for x in df_tr.columns if x not in excluded_cols]

In [64]:
cols

['SibSp',
 'Parch',
 'Fare',
 'Age_filled',
 'Pclass_1',
 'Pclass_2',
 'Pclass_3',
 'Sex_female',
 'Sex_male',
 'Embarked_C',
 'Embarked_Q',
 'Embarked_S']

## Random Forest

### Parameter Calibration

A random forest is controlled by a series of parameters, both for the forest itself and for each of its trees.
In order to choose adequate values, we shall use a grid search with 10-fold cross-validation, evaluating model
performance over each training/validation set in order to find optimal parameter values.
Since that grid search creates fits the model on each point of the grid, one must choose carefully which parameters
are to be tested and the values that must be tested for each one.

Our first relevant parameter is `n_estimators`, that is, the number of trees on the forest.
Since our feature list `cols` has 12 features, we have decided to test three possible values: 6, 12, and 24.
After that, one must decide the `criterion` used to evaluate whether a node split is worth it.
For classification trees, one may use either the Gini impurity or the information gain from the additional node
as criteria, and we shall test both possibilities.
The depth of each tree is controlled by the `max_depth` parameter, and we shall test whether the model performs better
with a maximum depth of 6 or with no specified maximum depth.

It would be possible to test the `max_features` parameter, which controls the number of features taken 
into account for the best split, but the default value of `log2` of the number of features is enough for our purposes.

In [18]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV

In [21]:
rf_grid = {
    'n_estimators': [6, 12, 24],
    'criterion': ['gini', 'entropy'],
    'max_depth': [6, None]
}

In [47]:
rf_model = RandomForestClassifier()

In [48]:
rf_cv = GridSearchCV(rf_model, rf_grid, cv=10)

In [65]:
rf_cv.fit(df_tr.ix[:, cols], df_tr.ix[:, 'Survived'])

GridSearchCV(cv=10, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'max_depth': [6, None], 'n_estimators': [6, 12, 24], 'criterion': ['gini', 'entropy']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [66]:
rf_cv.best_estimator_

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=6, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=12, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [67]:
rf_cv.best_score_

0.83389450056116721

After finding the best parameters for our model, one should see how it fares when attempting to predict
the data from the test set.
Then, we save

In [105]:
df_test['Survived'] = rf_cv.predict(df_test.ix[:, cols])

In [110]:
df_test[['PassengerId', 'Survived']].to_csv('./output/rf.csv', index=False)

The output obtained from this model obtained a score of 0.80861. 
While this is enough to land a top 15% placement, it can still be improved, and that's the next goal here.
