# Titanic Survival Prediction
Author: Georg Brandmayr

This notebook implements the prediction of titanic survivors. A training dataset `train.csv` with 918 passengers including survival status is used to learn a prediction function. A test set `test.csv` contains 418 passengers and will be used to predict their survival.

Your results must be reproducible - please **don't overlook the rules for `random_state`** in the body of this notebook to obtain full credit for your .

In [None]:
import pandas as pd
print('pandas', pd.__version__)
import numpy as np
print('numpy', np.__version__)
import seaborn as sns
from pathlib import Path


## Obtain and explore

In [None]:
data_path = Path.cwd()

df = pd.read_csv(data_path/'train.csv', index_col=0)
df[:3]

In [None]:
df.Survived.value_counts().plot.pie(autopct='%.1f%%', 
                                    wedgeprops={'linewidth': 3.0, 'edgecolor': 'white'}, 
                                    textprops=dict(size='x-large', color='white', fontweight='bold')).set_title('Survived');

In [None]:
df.groupby(['Pclass']).Survived.agg(Survivor_ratio='mean', Passengers='size')

In [None]:
#df.set_index('Survived', append=True)['Age'].unstack().boxplot()
sns.boxplot(df, x='Survived', y='Age');

## Model on prior probability


In [None]:
import sklearn
print('sklearn', sklearn.__version__)
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold


In [None]:
# a pessimist can never be disappointed
class Pessimist(BaseEstimator, ClassifierMixin):
    def __init__(self):
        self.fit(None, None)
    def fit(self, X, y):
        return self
    def predict(self, X):
        return np.zeros(len(X,), dtype='int8')
    
model = Pessimist()
model

In [None]:
y_train = df['Survived']
X_train = df[df.columns[1:]]

yh = model.predict(X_train)

score = accuracy_score(y_train, yh)
score

## Test model
The test targets are not available. To obtain the test score the predictions on the `test data` must be submitted.

In [None]:
# load the test data
X_test = pd.read_csv(data_path/'test.csv', index_col=0)
X_test

In [None]:
yh = model.predict(X_test)
yh = pd.Series(yh, X_test.index, name='Survived')

Save the result as CSV for submission.

In [None]:
yh.to_csv(data_path/'submission_test.csv')

After submitting `test_submission.csv` online the test score was published under our ID. It is 0.622.

In [None]:
scores = pd.DataFrame(dict(train=score, 
                          test=0.622), 
                     index=['Pessimist'])
scores

Nice, the test score fits well to the training result. 

## Model on data
Result must be reproducible, i.e., multiple runs must result in the same result.

All `random_state` parameters (for models, data splitting, etc.) must be fixed - for reproducibility - based on your **random ID in `RandomID.csv`**. 

In [None]:
random_id = 1
# a RandomState object or the id may be used, choose a variant
random_state = np.random.RandomState(random_id)
random_state = random_id
#random_state = None

Let's create another model, based on the data.

In [None]:
features = ['Pclass']#, 'Age']

In [None]:
y_train = df['Survived']
X_train = df[features]
X_train

In [None]:
model = DecisionTreeClassifier(random_state=random_state)
model

Let's train the model. We evaluate it on the train set to assess overfitting.

In [None]:
model.fit(X_train, y_train)
score = accuracy_score(y_train, model.predict(X_train))
scores.loc['DT', 'train'] = score
score

Cross validation can be used as a proxy for test performance

In [None]:
cv = 7
s = cross_val_score(model, X_train, y_train, cv=cv)
m = s.mean()
sd = s.std()
scores.loc['DT', 'cv'] = m
# assume a Gaussian dist.
l = m - 1.96*sd/(len(s)**.5)
u = m + 1.96*sd/(len(s)**.5)
print(f'CV score = {m:.3f}±{sd:.3f}, 95% CI [{l:.3f}, {u:.3f}], folds:', s,)
#model.fit(X_train, y_train)
scores


For comparison:

In [None]:
scores.loc['Pessimist', 'cv'] = cross_val_score(Pessimist(), X_train, y_train, cv=cv).mean()
scores

In [None]:
ax = scores[['train', 'cv', 'test']].T.plot.bar()
ax.grid()
x, dx = 2, .2
ax.hlines(y=0.775, xmin=x - dx, xmax=x + dx, linestyle='--', color='k')
ax.set_ylim([.55, 0.85])
ax.set_title('Accuracy comparison');
ax.legend(['Minimum', *scores.index],);# loc='lower right')

The data based model improved in cross validation and no extreme overfitting occurs on the train set. 

How will it perform on the test set? 

In [None]:
yh = model.predict(X_test[features])
yh = pd.Series(yh, X_test.index, name='Survived')
# save submission
yh.to_csv(data_path/'submission_test.csv') 

Ready to submit the predictions! Upload submission.