In this notebook, we will introduce a basic and simple method of ensembling (combining) base learning models, in particular the variant of ensembling known as Stacking. In nutshell stacking uses as a first-level (base), the predictions of a few basic classifiers and then uses another model at the second-level to predict the output from the earlier first level predictions.

# 1. Loading the libraries and data sets

In [1]:
#load libraries
import pandas as pd
import numpy as np
import re
import sklearn
import xgboost as xgb
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls

from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.svm import SVC
from sklearn.cross_validation import KFold;

#load dataset
train = pd.read_csv(r'C:\Users\LW130003\Documents\GitHub\titanic\train_modified.csv')
test = pd.read_csv(r'C:\Users\LW130003\Documents\GitHub\titanic\test_modified.csv')

train.head()



Unnamed: 0,Pclass,Survived,IsAlone,Embarked_C,Embarked_Q,Embarked_S,Sex_female,Sex_male,"AgeBand_(-0.08, 16.0]","AgeBand_(16.0, 32.0]",...,"FareBand_(-0.512, 102.4]","FareBand_(102.4, 204.8]","FareBand_(204.8, 307.2]","FareBand_(307.2, 409.6]","FareBand_(409.6, 512.0]",Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Rare
0,3,0.0,0,0,0,1,0,1,0,1,...,1,0,0,0,0,0,0,1,0,0
1,1,1.0,0,1,0,0,1,0,0,0,...,1,0,0,0,0,0,0,0,1,0
2,3,1.0,1,0,0,1,1,0,0,1,...,1,0,0,0,0,0,1,0,0,0
3,1,1.0,0,0,0,1,1,0,0,0,...,1,0,0,0,0,0,0,0,1,0
4,3,0.0,1,0,0,1,0,1,0,0,...,1,0,0,0,0,0,0,1,0,0


# 2. Ensemble and Stacking models

Create a class to help us or make it more convenient to us.
A class helps to extend some code/program for creating objects as well as to implement functions and methods specific to that class.

Below we write a class SklearnHelper that allows one to extend the inbuilt methods (such as train, predict and fit) common to all the Sklearn Classifier. Therefor this cuts out redundancy as we don't need to write the same methods five times if we wanted to invoke five different classifiers.

In [2]:
#some useful parameters which will come in handy later on
ntrain = train.shape[0]
ntest = test.shape[0]
SEED = 0 #for reproducibility
NFOLDS = 5 #set folds for out of fold prediction
kf = KFold(ntrain, n_folds=NFOLDS, random_state=SEED)

#class to extend the Sklearn classifier
class SklearnHelper(object):
    def __init__(self, clf, seed=0, params=None):
        params['random_state']=seed
        self.clf = clf(**params)
        
    def train(self, x_train, y_train):
        return(self.clf.fit(x_train, y_train))
        
    def predict(self,x):
        return(self.clf.predict(x))
    
    def fit(self,x,y):
        return(self.clf.fit(x,y))
    
    def feature_importances(self,x,y):
        return(list(self.clf.fit(x,y).feature_importances_))    

**def init**: Python standard for invoking the default constructor for the class. This means that when you want to create an object (classifier), you have to give it the parameters of clf (what sklearn classifier you want), seed (random seed) and params (parameters for the classifiers).


**Out-of-Fold predictions**
Stacking uses predictions of base classifiers as input for training to a second-level modem. However one cannot simpy train the base models on the full training data, generate predictions on the full test set and then output these for the second-level training. This runs the risk of your base model predictions already having "seen" the test set and therefore overfitting when feeding these predicitons.

In [3]:
def get_oof(clf, x_train, y_train, x_test):
    oof_train = np.zeros((ntrain,))
    oof_test = np.zeros((ntest,))
    oof_test_skf = np.empty((NFOLDS, ntest))
    
    for i, (train_index, test_index) in enumerate(kf):
        x_tr = x_train[train_index]
        y_tr = y_train[train_index]
        x_te = x_train[test_index]
        
        clf.train(x_tr, y_tr)
        
        oof_train[test_index] = clf.predict(x_te)
        oof_test_skf[i,:] = clf.predict(x_test)
        
    oof_test[:] = oof_test_skf.mean(axis=0)
    return oof_train.reshape(-1,1), oof_test.reshape(-1,1)

# 3. First-Level Predictions

**Generating our Base First-Level Models**
Prepare 5 learning models as our first level classification:
1. Random Forest Classifier
2. Extra trees Classifier
3. AdaBoost Classifier
4. Gradient Boosting Classifier
5. Support Vector Machine

**Parameters**
1. **n_jobs** = Number of cores used for the training process. If set to -1, all cores are used.
2. **n_estimators** = Number of classification trees in your learning model (set to 10 per default).
3. **max_depth** = Maximum depth of tree, or how much a node should be expanded. Beware if set to too high anumber would run the risk of overfitting as one would be growing the tree too deep.
4. **verbose** = Controls whether you want to output any text during the learning process. A value of 0 suppresses all text while a value of 3 outputs the tree learning process at every iteration.

In [4]:
#put in our parameters for said classifiers

#random forest parameters
rf_params = {'n_jobs' : -1,
             'n_estimators' : 500,
             'warm_start' : True,
             #'max_features' : 0.2,
             'max_depth' : 6,
             'min_samples_leaf' : 2,
             'max_features' : 'sqrt',
             'verbose' : 0
}

#extra trees parameters
et_params = {'n_jobs' : -1,
             'n_estimators' : 500,
             #'max_features' : 0.5,
             'max_depth' : 8,
             'min_samples_leaf' : 2,
             'verbose' : 0
}

#adaboost parameters
ada_params = {'n_estimators' : 500,
              'learning_rate' : 0.75
}

#gradient boosting parameters
gb_params = {'n_estimators' : 500,
             #'max_features' : 0.2,
             'max_depth' : 5,
             'min_samples_leaf' : 2,
             'verbose' : 0    
}

#support vector classifier parameters
svc_params = {'kernel' : 'linear',
              'C' : 0.025
}

In [5]:
#create 5 objects that represents our 5 models
rf = SklearnHelper(clf=RandomForestClassifier, seed=SEED, params=rf_params)
et = SklearnHelper(clf=ExtraTreesClassifier, seed=SEED, params=et_params)
ada = SklearnHelper(clf=AdaBoostClassifier, seed=SEED, params=ada_params)
gb = SklearnHelper(clf=GradientBoostingClassifier, seed=SEED, params=gb_params)
svc = SklearnHelper(clf=SVC, seed=SEED, params=svc_params)

**Create NumPy arrrays out of our train and test sets**

In [6]:
#create NumPy arrrays out of train, test and target (Survived) dataframes to feed into our models
y_train = train['Survived'].ravel()
train = train.drop(['Survived'], axis=1)
x_train = train.values #create an array of the train data
x_test = test.values #create an array of the test data

**Output of the First Level Predictions**

In [7]:
#create our oof train and test predictions. Those base resutls will be used  as new features
rf_oof_train, rf_oof_test = get_oof(rf, x_train, y_train, x_test) #random forest
et_oof_train, et_oof_test = get_oof(et, x_train, y_train, x_test) #extra tree
ada_oof_train, ada_oof_test = get_oof(ada, x_train, y_train, x_test) #adaboost
gb_oof_train, gb_oof_test = get_oof(gb, x_train, y_train, x_test) #gradient boost
svc_oof_train, svc_oof_test = get_oof(svc, x_train, y_train, x_test) #support vector classifier

print('Training is complete')


Warm-start fitting without increasing n_estimators does not fit new trees.



Training is complete


**Feature Importances generated from different classifiers**

In [30]:
rf_feature = rf.feature_importances(x_train, y_train)
et_feature = et.feature_importances(x_train, y_train)
ada_feature = ada.feature_importances(x_train, y_train)
gb_feature = gb.feature_importances(x_train, y_train)

#create dataframe with features
cols = train.columns.values
feature_dataframe = pd.DataFrame({
    'features' : cols,
    'Random Forest feature importances' : rf_feature,
    'Extra Trees feature importances' : et_feature,
    'Adaboost feature importances' : ada_feature,
    'Gradient Boost feature importances' : gb_feature
})
feature_dataframe =  feature_dataframe[['features', 'Random Forest feature importances',
                                       'Extra Trees feature importances',
                                       'Adaboost feature importances',
                                       'Gradient Boost feature importances']]


Warm-start fitting without increasing n_estimators does not fit new trees.



**Interactive feature importances via Plotly scatterplots**

In [60]:
#scatter plot
for i in list(range(1,5)):
    trace = go.Scatter(
        x = feature_dataframe['features'].values,
        y = feature_dataframe[feature_dataframe.columns[i]].values,
        mode = 'markers',
        marker = dict(
            sizemode = 'diameter',
            sizeref = 1,
            size = 25,
            #size = feature_dataframe['AdaBoost feature importances'].values,
            #color = np.random.randn(500) #set color equal to a variable
            color = feature_dataframe[feature_dataframe.columns[i]].values,
            colorscale='Portland',
            showscale = True
        ),
        text = feature_dataframe['features'].values
    )
    data = [trace]
    
    layout = go.Layout(
        autosize=True,
        title=feature_dataframe.columns[i]
        #hovermode='closest',
        #xaxis = dict(
        #    title = 'Pop',
        #    ticklen = 5,
        #    zeroline = False,
        #    gridwidth = 2,
        #),
        #yaxis = dict(
        #    title = 'Feature Importance',
        #   ticklen = 5,
        #    gridwidth = 2,
        #),
        #showlegend = False
    )
    fig = go.Figure(data=data, layout=layout)
    py.iplot(fig, filename='scatter2010')

In [61]:
#create new column containing the average of values
feature_dataframe['mean'] = feature_dataframe.mean(axis=1) #axis = 1 computes the mean row wise
feature_dataframe.head(3)

Unnamed: 0,features,Random Forest feature importances,Extra Trees feature importances,Adaboost feature importances,Gradient Boost feature importances,mean
0,Pclass,0.176657,0.162531,0.028,0.201791,0.142245
1,IsAlone,0.028248,0.025395,0.018,0.194322,0.066491
2,Embarked_C,0.027389,0.024142,0.002,0.119896,0.043357


**Plotly Barplot of Average  Feature Importance**

In [76]:
y = feature_dataframe.sort_values(by='mean', ascending=False)['mean'].values
x = feature_dataframe.sort_values(by='mean', ascending=False)['features'].values
data = [go.Bar(
    x=x,
    y=y,
    width=0.5,
    marker=dict(
        color = feature_dataframe['mean'].values,
        colorscale='Portland',
        showscale=True,
        reversescale=False
    ),
    opacity=0.6    
)]

layout = go.Layout(
    autosize=True,
    title='Barplot of Mean Feature Importance',
        #hovermode='closest',
        #xaxis = dict(
        #    title = 'Pop',
        #    ticklen = 5,
        #    zeroline = False,
        #    gridwidth = 2,
        #),
        #yaxis = dict(
        #    title = 'Feature Importance',
        #   ticklen = 5,
        #    gridwidth = 2,
        #),
        #showlegend = False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='bar-direct-labels')

# 4. Second-Level Predictions from the First-Level Output

**First-level output as new fieatures**

In [78]:
base_predictions_train = pd.DataFrame({
    'RandomForest' : rf_oof_train.ravel(),
    'ExtraTrees' : et_oof_train.ravel(),
    'AdaBoost' : ada_oof_train.ravel(),
    'GradientBoost' : gb_oof_train.ravel(),
})
base_predictions_train.head()

Unnamed: 0,AdaBoost,ExtraTrees,GradientBoost,RandomForest
0,0.0,0.0,0.0,0.0
1,1.0,1.0,1.0,1.0
2,1.0,0.0,0.0,0.0
3,1.0,1.0,1.0,1.0
4,0.0,0.0,0.0,0.0


**Correlation Heatmap as the Second Level Training Set**

In [79]:
data = [go.Heatmap(
    z = base_predictions_train.astype(float).corr().values,
    x = base_predictions_train.columns.values,
    y = base_predictions_train.columns.values,
    colorscale='Portland',
    showscale=True,
    reversescale=True
)]
py.iplot(data,filename='labelled-heatmap')

In [80]:
x_train = np.concatenate((et_oof_train, rf_oof_train, ada_oof_train, gb_oof_train, svc_oof_train), axis=1)
x_test = np.concatenate((et_oof_test, rf_oof_test, ada_oof_test, gb_oof_test, svc_oof_test), axis=1)

**Second level learning model via XGBoost**

I'll not go in detail or tune the XGBoost. How to tune XGBoost can be read from my other exercise [loan_prediction_iii](https://github.com/LW130003/loan-prediction-iii).

In [82]:
gbm = xgb.XGBClassifier(
    #learning_rate = 0.02,
    n_estimators = 2000,
    max_depth = 4,
    min_child_weight = 2,
    #gamma =1,
    gamma = 0.9,
    subsample=0.8,
    colsample_bytree=0.8,
    objective='binary:logistic',
    nthread=-1,
    scale_pos_weight=1)
model = gbm.fit(x_train,y_train)
predictions = model.predict(x_test)