We know the following, from the competition descrition.

**VARIABLE DESCRIPTIONS**:

- survival: Survival
    (0 = No; 1 = Yes)
- pclass: Passenger Class
    (1 = 1st; 2 = 2nd; 3 = 3rd)
- name: Name
- sex: Sex
- age: Age
- sibsp: Number of Siblings/Spouses Aboard
- parch: Number of Parents/Children Aboard
- ticket: Ticket Number
- fare: Passenger Fare
- cabin: Cabin
- embarked: Port of Embarkation
     (C = Cherbourg; Q = Queenstown; S = Southampton)

**SPECIAL NOTES**:

- Pclass is a proxy for socio-economic status (SES) -  1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower
- Age is in Years; Fractional if Age less than One (1)
 If the Age is Estimated, it is in the form xx.5
- With respect to the family relation variables (i.e. `sibsp` and `parch`)
some relations were ignored.  The following are the definitions used
for `sibsp` and `parch`.

  - Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
  - Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
  - Parent:   Mother or Father of Passenger Aboard Titanic
  - Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws.  Some children travelled
only with a nanny, therefore `parch=0` for them.  As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.

# Notes

We don't have many data, so will use cross validation instead of a separate Validation set. This will give us a score not useful as a generalization error, but will use it anyway for selection (as it should be).

In [160]:
import pandas as pd
import numpy as np

test = pd.read_csv('../input/test.csv')
train = pd.read_csv('../input/train.csv')

## Simplifying final scoring via a downloaded version of survivors

Data comes from [http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets](), where some other data sets are available.

In [161]:
# There is an all empty record at the end
compiled_data = pd.read_csv('../input/titanic3.csv').dropna(how='all')

# The quotes in the names are wrong for `test.csv` and `train.csv`.
#  In general we don't care, so they are modified on the fly for the generation of the output test set
test_for_merger = test.copy()
test_for_merger['Name'] = test_for_merger['Name'].apply(lambda x: x.replace('"',''))
compiled_data['Name'] = compiled_data['name'].apply(lambda x: x.replace('"',''))

# We use both `Name` and `Ticket` to merge, because some passenger have duplicated names (we were getting wrong lengths before)
y_test = test_for_merger.merge(compiled_data, left_on=['Name', 'Ticket'], 
                    right_on=['Name', 'ticket'], how='left').rename(columns={'survived': 'Survived'})['Survived'].astype('int')

In [162]:
def generalization_error(prediction):
    print(((y_test - prediction) == 0).value_counts(normalize=True))

We will use the above data just to run the generalization error calculation, not for training, not for anything else.

In order to easily generate the CSV submission file when it is time to do so, we define the function below.

In [163]:
def csv_from_prediction(prediction, filename='submission.csv'):
    submission = pd.DataFrame(data={'PassengerId': test['PassengerId'], 'Survived': prediction.astype(int)})

    # This is what we do if we don't use 'index=False' below
    #submission.set_index('PassengerId', drop=True, inplace=True)
    submission.to_csv(filename, index=False)

Let's get our bases covered:

In [164]:
y_train = train['Survived']
X_train = train.drop('Survived', axis=1)
X_test = test.copy()

In [165]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
PassengerId    891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 76.6+ KB


In [166]:
X_train['Embarked'].value_counts()
# We can see 'S' is the most common value

S    644
C    168
Q     77
Name: Embarked, dtype: int64

### Titles are collected from both sets, and ages set from global means

In [167]:
import re
titles = X_train['Name'].append(X_test['Name']).apply(lambda x: re.match('.*,([^\.]+)\..*', x)[1].strip())
title_equivalences = {'Don': 'Mr', 'Dona': 'Mrs', 'Mlle': 'Miss', 'Mme': 'Mrs', 'Jonkheer': 'Lady', 'the Countess': 'Lady'}

X_train['Title'] = X_train['Name'].apply(lambda x: re.match('.*,([^\.]+)\..*', x)[1].strip())
X_test['Title'] = X_test['Name'].apply(lambda x: re.match('.*,([^\.]+)\..*', x)[1].strip())
for k,v in title_equivalences.items():
    X_train.loc[X_train['Title'] == k, 'Title'] = v
    X_test.loc[X_test['Title'] == k, 'Title'] = v
    
#title_mapping = {v: k for k, v in enumerate(title_mapping)}  # A Dict
#title_mapping['Mme'] = title_mapping['Mrs']
#title_mapping['Mlle'] = title_mapping['Miss']
#title_mapping['the Countess'] = title_mapping['Lady']
#title_mapping['Don'] = title_mapping['Mr']
#title_mapping['Dona'] = title_mapping['Mrs']
#inverse_title_mapping = {v: k for k, v in title_mapping.items()}  # The inverse Dict

In [168]:
ticket_sizes = (X_train['Ticket'].append(X_test['Ticket'])).value_counts()
        
def data_munge(data):
    useless_fields = ['PassengerId']
    data.drop(useless_fields, axis=1, inplace=True, errors='ignore')
    
    # This is the most common value
    data['Embarked'].fillna('S', inplace=True)
    
    data = pd.get_dummies(data, columns=['Sex', 'Embarked', 'Pclass']) #.drop(['Sex_male', 'Embarked_C', 'Pclass_3'], axis=1)
    data['GroupSize'] = data['Ticket'].apply(lambda x: ticket_sizes[x])
    data['NameLength'] = data['Name'].apply(lambda x: len(x))
    data['FamilyOnBoard'] = data['SibSp'] + data['Parch']
        
    # Titles and ages were already compiled globally, but the `Title` was not dummified (and won't be)
    #data = pd.get_dummies(data, columns=['Title'])

    data['Kid'] = np.zeros(np.shape(data['Title']))
    data.loc[data['Title'] == 'Master', 'Kid'] = 1
    data.loc[data['Title'] == 'Miss', 'Kid'] = 1

    data['MumWithKid'] = np.zeros(np.shape(data['GroupSize']))
    data.loc[(data['Parch'] == 1) & (data['GroupSize'] == 3) & 
               (data['Sex_female'] == 1) & (data['Kid'] == 0), 'MumWithKid'] = 1
    
    data['CabinFirstLetter'] = data['Cabin'].apply(lambda x: x[0] if type(x)=='str' else '')
    data = pd.get_dummies(data, columns=['CabinFirstLetter']).drop(['Cabin', 'CabinFirstLetter_'], axis=1)
    
    data['Fare'] = data['Fare'].fillna(train['Fare'].mean())
    
    for k in titles.unique():
        mean_title_age = X_train['Age'].append(X_test['Age']).loc[X_train['Title'].append(X_test['Title'])==k].mean()
        data.loc[(data['Title'] == k) & data['Age'].isnull(), 'Age'] = mean_title_age
    return data

In [169]:
X_train = data_munge(X_train)

# Simplest attempts (baselines)

A random assignment should give a 50% accuracy.

In [170]:
random_prediction = np.random.randint(0,2,len(y_train))
print("{:.2f}% of accuracy".format(1- abs(y_train - random_prediction).sum()/len(y_train)))
((y_train - random_prediction) == 0).value_counts(normalize=True)

0.52% of accuracy


True     0.517396
False    0.482604
Name: Survived, dtype: float64

A majority assignment should be better because the results are unbalanced, and that would be the benchmark to beat.

In [171]:
y_train.value_counts(normalize=True)

0    0.616162
1    0.383838
Name: Survived, dtype: float64

In [172]:
majority_prediction = np.full(len(y_train),0)
print("{:.2f}% of accuracy".format(1- abs(y_train - majority_prediction).sum()/len(y_train)))
((y_train - majority_prediction) == 0).value_counts(normalize=True)

0.62% of accuracy


True     0.616162
False    0.383838
Name: Survived, dtype: float64

# Process Test the same way

In [173]:
X_test = data_munge(X_test)

In [174]:
X_test.columns.values

array(['Name', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Title',
       'Sex_female', 'Embarked_Q', 'Embarked_S', 'Pclass_1', 'Pclass_2',
       'GroupSize', 'NameLength', 'FamilyOnBoard', 'Kid', 'MumWithKid'], dtype=object)

In [175]:
for i in np.setdiff1d(X_train.columns.values, X_test.columns.values):
    X_test[i] = 0

# Proper learning going on

In [176]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.tree import export_graphviz
import graphviz

In [177]:
features_ignored = ['Name', 'Title', 'Ticket', 'Parch', 'SibSp']

In [178]:
print(cross_val_score(DecisionTreeClassifier(), X_train.drop(features_ignored, axis=1), y_train))
cross_val_score(RandomForestClassifier(n_estimators=10), X_train.drop(features_ignored, axis=1), y_train).mean()

[ 0.73737374  0.75420875  0.76094276]


0.7991021324354658

In [179]:
cross_val_score(RandomForestClassifier(n_estimators=100, min_samples_split=19, min_samples_leaf=2),
                X_train.drop(features_ignored, axis=1), y_train).mean()

0.82379349046015715

In [180]:
model = RandomForestClassifier(n_estimators=100, min_samples_split=19, min_samples_leaf=2)
model.fit(X_train.drop(features_ignored, axis=1), y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=2,
            min_samples_split=19, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [181]:
prediction = model.predict(X_test.drop(features_ignored, axis=1))
generalization_error(prediction)
# I used to keep this result, but now it is quite bad (and it improved just by cleaning the notebook um??)

True     0.779904
False    0.220096
Name: Survived, dtype: float64


In [182]:
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train.drop(features_ignored, axis=1), y_train)
prediction = model.predict(X_test.drop(features_ignored, axis=1))
generalization_error(prediction)

True     0.763158
False    0.236842
Name: Survived, dtype: float64


In [183]:
#tree_dot = export_graphviz(model.estimators_[1], out_file=None, feature_names=X_test.drop(['Name', 'Ticket', 'Cabin'], axis=1).columns, filled=True)
#graphviz.Source(tree_dot, format="png")

In [184]:
%%time
param_grid = { 'n_estimators': np.arange(20,200,20),
               'min_samples_leaf': np.arange(1, 5),
               'min_samples_split' : np.arange(4, 18,2),
               'max_depth': np.arange(4,6, 1)}

param_grid = { 'n_estimators': [80],
               'min_samples_leaf': [3],
               'min_samples_split' : [6],
               'max_depth': [5]}
grid = GridSearchCV(RandomForestClassifier(warm_start=True, n_jobs=-1), param_grid=param_grid, cv=5)
grid.fit(X_train.drop(features_ignored, axis=1), y_train)

CPU times: user 784 ms, sys: 130 ms, total: 915 ms
Wall time: 1.98 s


In [185]:
grid.best_estimator_

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=3,
            min_samples_split=6, min_weight_fraction_leaf=0.0,
            n_estimators=80, n_jobs=-1, oob_score=False, random_state=None,
            verbose=0, warm_start=True)

In [186]:
grid.best_score_

0.82603815937149272

In [187]:
prediction = grid.predict(X_train.drop(features_ignored, axis=1))
((y_train - prediction) == 0).value_counts(normalize=True)

True     0.845118
False    0.154882
Name: Survived, dtype: float64

In [188]:
prediction = grid.predict(X_test.drop(features_ignored, axis=1))
generalization_error(prediction)

True     0.76555
False    0.23445
Name: Survived, dtype: float64


Let's refit with the parameters identified:

In [189]:
rfc = RandomForestClassifier(warm_start=True, n_jobs=-1,
                             n_estimators=50,
                             max_depth=4, max_features=5 ,
                             min_samples_leaf=1, min_samples_split=6)
rfc.fit(X_train.drop(features_ignored, axis=1), y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=4, max_features=5, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=6, min_weight_fraction_leaf=0.0,
            n_estimators=50, n_jobs=-1, oob_score=False, random_state=None,
            verbose=0, warm_start=True)

In [190]:
prediction = rfc.predict(X_train.drop(features_ignored, axis=1))
print(((y_train - prediction) == 0).value_counts(normalize=True))

True     0.839506
False    0.160494
Name: Survived, dtype: float64


In [191]:
#prediction = grid.predict(X_test.drop(features_ignored, axis=1))
#csv_from_prediction(prediction, filename='submission_tree_3.csv')

prediction = rfc.predict(X_test.drop(features_ignored, axis=1))
csv_from_prediction(prediction, filename='submission_tree_4.csv')
generalization_error(prediction)

True     0.767943
False    0.232057
Name: Survived, dtype: float64


In [None]:
feature_importances = pd.DataFrame()
for est in rfc.estimators_:
    feature_importances = feature_importances.append(pd.Series(est.feature_importances_, 
                                        index= X_test.drop(features_ignored, axis=1).columns),
                             ignore_index=True)
feature_importances.mean().sort_values(ascending=False)

# A visualization from a previous Random Forest Classifier

In [None]:
tree = export_graphviz(model.estimators_[11], out_file=None,
                       feature_names=X_test.drop(features_ignored, axis=1).columns, filled=True)
graphviz.Source(tree, format="png")

In [None]:
feature_importances = pd.DataFrame()
for est in model.estimators_:
    feature_importances = feature_importances.append(pd.Series(est.feature_importances_, 
                                        index= X_test.drop(features_ignored, axis=1).columns),
                             ignore_index=True)

In [None]:
feature_importances.mean().sort_values(ascending=False)