# Titanic: Machine Learning from Disaster

**Outline**

* [Read Data](#read)
* [EDA](#eda)
* [Feature Creation and Preprocessing](#preprocess)
* [Modeling](#model)
* [Scoring](#score)
* [Predicition](#predict)
* [Reference](#reference)

---

In [286]:
%load_ext watermark
import os
import pandas as pd
import numpy as np
import datetime
from sklearn.ensemble import RandomForestClassifier
from sklearn import model_selection
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV #train_test_split, 

In [297]:
%watermark -a 'Johnny' -d -t -v -p pandas,numpy,sklearn,watermark

Johnny 2017-12-17 14:06:48 

CPython 3.6.2
IPython 6.2.1

pandas 0.20.3
numpy 1.13.1
sklearn 0.19.1
watermark 1.5.0


## <a id="read">Read Data</a>

In [6]:
# read train data
data_dir = os.path.join('..', 'data')
path_train = os.path.join(data_dir, 'train.csv')
train = pd.read_csv(path_train)
train.head()

# read test data
path_test = os.path.join(data_dir, 'test.csv')
test = pd.read_csv(path_test)
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [7]:
train.shape

(891, 12)

## <a id="eda">EDA</a>

* **survival**:	Survival, 	[0 = No, 1 = Yes]

* **pclass**: Ticket class,	[1 = 1st, 2 = 2nd, 3 = 3rd]
    * A proxy for socio-economic status (SES)
        * 1st = Upper
        * 2nd = Middle
        * 3rd = Lower
* **sex**: Sex	
* **Age**: Age in years. Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
* **sibsp**: Number of siblings / spouses aboard the Titanic. 
    * Sibling = brother, sister, stepbrother, stepsister
    * Spouse = husband, wife (mistresses and fiancés were ignored)	
* **parch**: Number of parents / children aboard the Titanic	
    * Parent = mother, father
    * Child = daughter, son, stepdaughter, stepson
    * Note: Some children travelled only with a nanny, therefore parch=0 for them.
* **ticket**: Ticket number	
* **fare**: Passenger fare	
* **cabin**: Cabin number	
* **embarked**: Port of Embarkation, [C = Cherbourg, Q = Queenstown, S = Southampton]

**How many survivals in our train data?**

We see that the number of survivals in the training data is not very unbalanced. Luckily, we don't need to deal with the unbalance issue in the problem.

<img src="pic/Survived.png" style="width: 600px;height: 450px;"/>



**What is the survival rate across different features?**

In all of the following plot, we see that the survival rate is somewhat different between different values for every features. Therefore, we decide to include all the feature to build our model.

> **Age**

<img src="pic/Age vs Survival.png" style="width: 600px;height: 450px;"/>


> **Sex**

<img src="pic/Sex vs Survival.png" style="width: 600px;height: 450px;"/>


> **Pclass**

<img src="pic/Pclass vs Survival.png" style="width: 600px;height: 450px;"/>

> **SibSp**

<img src="pic/Sibsp vs Survival.png" style="width: 600px;height: 450px;"/>

> **Parch**

<img src="pic/Parch vs Survival.png" style="width: 600px;height: 450px;"/>

> **Fare**

<img src="pic/Fare vs Survival.png" style="width: 600px;height: 450px;"/>

> **Cabin**

<img src="pic/Cabin vs Survival.png" style="width: 600px;height: 450px;"/>

## <a id="preprocess">Preprocessing</a>

Since sklearn ensemble models doesn't accept categorical features as input type, we firstly need to transform all the categorical features into numeric values using label encoder. We could also use one-hot-encoding. Let's save it for later.

**Label Encoder**

In [66]:
def preprocessor(train, test, response, sel_feature, cat_feature):
    """
    1. Preprocess and return Train and Test dataset. 
    2. Preprocessing including selecting feature according to the input feature list.
    3. Performing label encoding using category feature list.
    4. Imputing columns with missing value with median

    Parameters
    ----------
    train : pandas dataframe
        A dataframe containing train data that includes the response column

    test : pandas dataframe
        A dataframe containing train test

    response : str
        Features that have importance scores lower than this
        threshold will not be presented in the plot, this assumes
        the sum of the feature importance sum up to 1.

    sel_feature : list
        A list contains all the needed feature for later model training
    
    cat_feature : list
        A list contains the categorical feature that need to be label encodes            
    """
    
    # create response data
    y_train = train['Survived']
    
    # combine data to do label encoding
    X_train = train[sel_feature]
    X_train['isTrain']=1
    X_test = test[sel_feature]
    X_test['isTrain']=0
    X_all = pd.concat([X_train,X_test])
    
    # perform label encoding
    X_all_new = X_all.copy()
    le = LabelEncoder()
    for col in cat_feature:
        X_all_new[col]= le.fit_transform(X_all[col])

    X_all = X_all_new.copy()
    
    # Dealing with missing data from the selected feature. Impute missing value with Median
    # pd.isnull(X_all).any(axis=0)
    X_all['Age'] = X_all['Age'].fillna(X_all['Age'].median())
    X_all['Fare'] = X_all['Fare'].fillna(X_all['Fare'].median())    
        
    # separate train and test after label encoding
    X_train = X_all[X_all['isTrain']==1]
    X_test = X_all[X_all['isTrain']==0]
    
    # delete unneeded columns
    del X_train['isTrain']
    del X_test['isTrain']
    
    return(X_train, y_train, X_test)
    

In [67]:
response='Survived'
sel_feature=['Pclass','Sex','Age','SibSp','Parch','Fare']
cat_feature = ['Pclass','Sex']

In [68]:
X_train, y_train, X_test = preprocessor(train, test, response, sel_feature, cat_feature)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


## <a id="model">Model and Score</a>

In this session, we are going to build model using Bagging, Random forest, and gradient boosting machine by treating our response variable, i.e., survive or not, as a categorical features. After fitting a base model, we also use GridSearchCV to do parameter tuning.

In order to prevent overfitting, we use 10 fold cross-validation to obtain an estimate of the test error and use it as the criterion to select our best model. Note that the model with the best CV score is not neccessary the best model, since CV score is only an estimate of the test error. The actual test error rate can only be obtained using the actual test dataset on kaggle.

In [220]:
def parameter_tuning(model, X_train, y_train, param_grid):   
    """
    Tune a tree based model using GridSearch, and return a model object with an updated parameters
    
    Parameters
    ----------
    model: sklearn's ensemble tree model
        the model we want to do the hyperparameter tuning.
    
    X_train: pandas DataFrame
        Preprocessed training data. Note that all the columns should be in numeric format.
    
    y_train: pandas Series
    
    param_grid: dict
        contains all the parameters that we want to tune for the responding model.    
        

    Note
    ----------
    * we use kfold in GridSearchCV in order to make sure the CV Score is consistent with the score 
      we get from all the other function, including fit_bagging, fit_randomforest and fit_gbm. 
      We use model_selection.KFold with fixed seed in order to make sure GridSearchCV uses the same seed as model_selection.cross_val_score.
    
    """
    seed=7
    
#     if 'n_estimators' in param_grid:
#         model.set_params(warm_start=True) 
    
    kfold = model_selection.KFold(n_splits=10, random_state=seed)
    gs_model = GridSearchCV(model, param_grid, cv=kfold)
    gs_model.fit(X_train, y_train)
    # best hyperparameter setting
    print('best parameters:{}'.format(gs_model.best_params_)) 
    print('best score:{}'.format(gs_model.best_score_)) 
    
    # refit model on best parameters
    model.set_params(**gs_model.best_params_)
    model.fit(X_train, y_train)

    return(model)

**Bagging**

* **Fit a base model using default parameters**

In [210]:
def fit_bagging(X_train, y_train):
    """Bagged Decision Trees for Classification"""
    seed = 7
    kfold = model_selection.KFold(n_splits=10, random_state=seed)
    cart = DecisionTreeClassifier()
    model = BaggingClassifier(base_estimator=cart, random_state=seed)
    results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold)
    print(results.mean())
    
    model.fit(X_train, y_train)
    
    return(model)

In [211]:
fit_bagging(X_train, y_train)

0.817153558052


BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=10, n_jobs=1, oob_score=False,
         random_state=7, verbose=0, warm_start=False)

* **Parameter Tuning**

In [191]:
seed = 7
num_trees=100
cart = DecisionTreeClassifier()
bag = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=seed)

In [201]:
param_grid_bag_1 = {
    'max_samples': [0.1, 0.3, 0.5, 0.7, 0.9, 1],
    'max_features': [0.3, 0.5, 0.7, 0.9, 1]
                   }

In [202]:
bag_2 = parameter_tuning(bag, X_train, y_train, param_grid_bag_1)

best parameters:{'max_features': 0.9, 'max_samples': 0.1}
best score:0.8395061728395061


In [249]:
param_grid_bag_2 = {
    'max_samples': [i/100 for i in range(5, 16, 1)],
    'max_features': [i/100 for i in range(70, 90, 1)]
}

In [250]:
bag_3 = parameter_tuning(bag_2, X_train, y_train, param_grid_bag_2)

best parameters:{'max_features': 0.84, 'max_samples': 0.1}
best score:0.8395061728395061


  warn("Warm-start fitting without increasing n_estimators does not "


In [251]:
param_grid_bag_3 = {
    'n_estimators': [10, 20, 50, 100,101, 110, 200]
                   }

In [252]:
bag_4 = parameter_tuning(bag_3, X_train, y_train, param_grid_bag_3)

best parameters:{'n_estimators': 100}
best score:0.8395061728395061


  warn("Warm-start fitting without increasing n_estimators does not "


**Random Forest**

* **Fit a base model using default parameters**

In [180]:
def fit_randomforest(X_train, y_train):
    seed=7
    kfold = model_selection.KFold(n_splits=10, random_state=seed)
    model = RandomForestClassifier(random_state=seed)
    results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold)
    print(results.mean())
    
    model.fit(X_train, y_train)
    
    return model

In [181]:
fit_randomforest(X_train, y_train)

0.808152309613


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=7, verbose=0, warm_start=False)

* **Parameter Tuning**

In [182]:
seed=7
num_trees=100
rf = RandomForestClassifier(n_estimators=num_trees, random_state=seed)

In [183]:
param_grid_rf_1 = {
    'max_depth': [None, 4, 6, 8, 10],
    'min_samples_leaf': [1, 3, 5, 7, 9],
    'max_features': ['auto', 'log2', None]
                  }

In [184]:
rf_2 = parameter_tuning(rf, X_train, y_train, param_grid_rf_1)

best parameters:{'max_depth': 8, 'max_features': None, 'min_samples_leaf': 1}
best score:0.8428731762065096


In [185]:
param_grid_rf_2 = {'max_depth': [6, 7, 8, 9, 10, None]}

In [186]:
rf_3 = parameter_tuning(rf_2, X_train, y_train, param_grid_rf_2)

best parameters:{'max_depth': 8}
best score:0.8428731762065096


In [187]:
param_grid_rf_3 = {'n_estimators': [100, 200, 300, 400, 500]}

In [188]:
rf_4 = parameter_tuning(rf_3, X_train, y_train, param_grid_rf_3)

best parameters:{'n_estimators': 100}
best score:0.8428731762065096


  warn("Warm-start fitting without increasing n_estimators does not "


**Gradient Boosting Machine for Classification**

* **Fit a base model using default parameters**

In [76]:
def fit_gbm(X_train, y_train):
    """Gradient Boosting Machine for Classification"""
    seed = 7   
    kfold = model_selection.KFold(n_splits=10, random_state=seed)
    model = GradientBoostingClassifier(random_state=seed)
    results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold)
    print(results.mean())
    
    model.fit(X_train, y_train)
    
    return(model)

In [149]:
gbm_base = fit_gbm(X_train, y_train)

0.839550561798


In [173]:
gbm_base

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=7, subsample=1.0, verbose=0,
              warm_start=False)

* **Parameter Tuning**

* Pick n_estimators as large as (computationally) possible (e.g. 3000)
* Tune max_depth, learning_rate, min_samples_leaf, and max_features via grid search
* Increase n_estimators even more and tune learning_rate again holding the other parameters fixed


In [114]:
seed=7
num_trees=100
gbm = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)

In [169]:
param_grid_gbm_1 = {'max_depth': [3,  5,  7,  9, 11],                    
                    'min_samples_leaf': [1, 3, 5, 7, 9], 
                    'max_features': ['auto', 'log2', None, 0.1, 0.3, 0.5, 1],
                    'subsample': [0.1, 0.5, 1]
                   }

In [172]:
gbm_2 = parameter_tuning(gbm, X_train, y_train, param_grid_gbm_1)

best parameters:{'max_depth': 5, 'max_features': 'auto', 'min_samples_leaf': 1, 'subsample': 1}
best score:0.8451178451178452


In [174]:
param_grid_gbm_2 = {'max_depth': [3,4,5,6,7]}

In [175]:
gbm_3 = parameter_tuning(gbm_2, X_train, y_train, param_grid_gbm_2)

best parameters:{'max_depth': 4}
best score:0.8462401795735129


In [176]:
param_grid_gbm_3 = {
                    'n_estimators': [100, 200, 300, 400, 500, 1000],
                    'learning_rate': [0.001, 0.01, 0.05, 0.1]
                   }

In [177]:
gbm_4 = parameter_tuning(gbm_3, X_train, y_train, param_grid_gbm_3)

best parameters:{'learning_rate': 0.1, 'n_estimators': 100}
best score:0.8462401795735129


## <a id="predict">Predicition</a>

Generate the submission file for kaggle using the best model from the previous section.

In [280]:
def make_prediction(model, X_test):
    """generate a prediction dataframe in the correct format for Kaggle submission using the input model and test data"""
    
    predict_y = model.predict(X_test)   
    submission = pd.DataFrame({'PassengerId': test.PassengerId,
                          'Survived': predict_y})
    
    return(submission)

def make_submission(df, title):
    """save submission file to disc titled with current date and time"""
    now = datetime.datetime.now()
    
    title_with_time = now.strftime("%m%d_%H%M")+ "_" +title  +  ".csv"
    submission_dir = os.path.join('..', 'submission')
    path_submissiotn = os.path.join(submission_dir, title_with_time)
    df.to_csv(path_submissiotn, sep=',', index = False)
    print("File Saved")

In [269]:
submission_base.Survived.value_counts()

0    276
1    142
Name: Survived, dtype: int64

In [268]:
submission_base = make_prediction(gbm_base, X_test)

In [281]:
make_submission(submission_base, 'submission_base_gbm')

File Saved


**Generate the submission file for the best model from Bagging**

In [270]:
submission_bag_best = make_prediction(bag_4, X_test)

In [271]:
submission_bag_best.Survived.value_counts()

0    281
1    137
Name: Survived, dtype: int64

In [282]:
make_submission(submission_bag_best, 'submission_bag_best')

File Saved


**Generate the submission file for the best model from Random Forest**

In [273]:
submission_rf_best = make_prediction(rf_4, X_test)

In [274]:
submission_rf_best.Survived.value_counts()

0    280
1    138
Name: Survived, dtype: int64

In [283]:
make_submission(submission_rf_best, 'submission_rf_best')

File Saved


**Generate the submission file for the best model from GBM**

In [277]:
submission_gbm_best = make_prediction(gbm_4, X_test)

In [278]:
submission_gbm_best.Survived.value_counts()

0    277
1    141
Name: Survived, dtype: int64

In [284]:
make_submission(submission_gbm_best, 'submission_gbm_best')

File Saved


## <a id="reference">Reference</a>

* [Sklearn Bagging Classifier Documentation](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html)
* [Sklearn Random Forest Classifier Documentation](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
* [Sklearn Gradient Boosting Classifier Documentation](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)
* [Sklearn GridSearchCV Documentation](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
* [Sklearn Kfold Documentation](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html)