<a href="https://colab.research.google.com/github/SarahGoddaer/Machine_Learning_course_UGent_D012554_kaggle/blob/master/7%3A%20Blending_Stacking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Blending and **stacking** are two approaches, both of which deal with model assembly.

In both blending and stacking, various prediction models are combined into one single prediction model, with the goal being to increase the prediction performance.

They both train different models with the same training data, using the outputs as training data for a meta classifier to predict a final result.

For example in binary classification, you train a SVM model or a decision tree. Then you can use the output of SVM and the decision tree to train a meta classifier such as logistic regression.

In [1]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import cross_val_score
import numpy as np
from sklearn import metrics

  import pandas.util.testing as tm


In [2]:
trainset = pd.read_csv("https://raw.githubusercontent.com/sdgroeve/Machine_Learning_course_UGent_D012554_kaggle/master/eeg_train.csv")
testset = pd.read_csv("https://raw.githubusercontent.com/sdgroeve/Machine_Learning_course_UGent_D012554_kaggle/master/eeg_test.csv")

features = trainset.copy()
features.pop('label')
feature_names = list(features.columns)

test_features = testset.copy()
test_features.pop('index')
test_feature_names = list(test_features.columns)
features.describe()

Unnamed: 0,AF3,F7,F3,FC5,T7,P7,O1,02,P8,T8,FC6,F4,F8,AF4
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,4300.157125,4009.27315,4263.86086,4122.616195,4341.60687,4620.06172,4072.15125,4615.2293,4200.893915,4230.573235,4201.58306,4278.445325,4605.169335,4359.85278
std,36.361719,29.853264,20.788323,20.565528,16.691038,18.034865,20.933632,18.391027,17.810272,19.661149,24.397269,19.645651,33.067591,37.074555
min,4197.95,3905.64,4202.56,4058.46,4310.26,4569.74,4032.82,4571.28,4147.69,4158.97,4107.18,4216.41,4454.36,4225.64
25%,4280.51,3990.77,4250.26,4108.72,4331.79,4611.79,4057.44,4604.1,4190.26,4219.49,4189.74,4267.18,4590.6425,4342.05
50%,4293.33,4006.15,4262.56,4121.03,4338.46,4617.95,4069.74,4612.82,4199.49,4228.72,4200.0,4276.41,4603.08,4354.36
75%,4309.74,4023.59,4270.26,4133.46,4347.18,4626.15,4083.59,4623.08,4209.23,4238.97,4211.28,4286.15,4617.95,4371.79
max,4497.44,4152.82,4385.64,4234.36,4452.82,4754.87,4174.87,4731.28,4315.38,4352.31,4325.64,4397.95,4796.92,4538.97


#Creating ensembles from submisstion files.

The most basic and convenient way to ensemble is to ensemble Kaggle submission CSV files. You only need the predictions on the test set for these methods — no need to retrain a model. This makes it a quick way to ensemble already existing model predictions, ideal when teaming up.

In [0]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import StackingClassifier

estimators = [('SVC',make_pipeline(StandardScaler(),
                                   SVC(C=10,gamma=1, probability=True))),
              ('rf', RandomForestClassifier(criterion='entropy',
                                            n_estimators=250, max_depth=30, 
                                            bootstrap=True)),
               ('boost', XGBClassifier(n_estimators= 900, gamma=0,
                                       max_depth=7, min_child_weight=0,
                                       learning_rate=0.1, subsample=0.85,
                                       colsample_bytree=0.9)),]

clf = StackingClassifier(estimators=estimators, 
                         final_estimator=LogisticRegression(),cv=10)

In [10]:
model = clf
score = cross_val_score(model, features, trainset.label,scoring='roc_auc', cv= 10).mean()
print('AUC score for trainset: '+ str(score))

AUC score for trainset: 0.9681853589835242


Yes, this actually gives my highest cross-validation score until now, so I'm going to upload predictions from this model:

In [6]:
model.fit(features,trainset.label)
predictions = model.predict_proba(test_features)[:,1]

sample_submission = pd.DataFrame({'index': testset['index'], 'label': predictions})
sample_submission.head()

Unnamed: 0,index,label
0,0,0.02333
1,1,0.191001
2,2,0.023217
3,3,0.981099
4,4,0.972256


In [0]:
filename = "my_prediction_results5.csv"
sample_submission.to_csv(filename,index=False)

I can explore this new technique a little bit further, engaging other models:

In [0]:
estimators = [('SVC',make_pipeline(StandardScaler(),
                                   SVC(C=10,gamma=1, probability=True))),
              ('rf', RandomForestClassifier(criterion='entropy',
                                            n_estimators=250, max_depth=30, 
                                            bootstrap=True)),
               ('boost', XGBClassifier(n_estimators= 900, gamma=0,
                                       max_depth=7, min_child_weight=0,
                                       learning_rate=0.1, subsample=0.85,
                                       colsample_bytree=0.9)),
              ('regression', make_pipeline(StandardScaler(),
                                           LogisticRegression(C=100)))]

clf = StackingClassifier(estimators=estimators, 
                         final_estimator=LogisticRegression(),cv=10)

In [6]:
model = clf
score = cross_val_score(model, features, trainset.label,scoring='roc_auc', cv= 10).mean()
print('AUC score for trainset: '+ str(score))

AUC score for trainset: 0.9675489759985172


In [0]:
estimators = [('SVC',make_pipeline(StandardScaler(),
                                   SVC(C=10,gamma=1, probability=True))),
              ('rf', RandomForestClassifier(criterion='entropy',
                                            n_estimators=250, max_depth=30, 
                                            bootstrap=True)),
               ('boost', XGBClassifier(n_estimators= 900, gamma=0,
                                       max_depth=7, min_child_weight=0,
                                       learning_rate=0.1, subsample=0.85,
                                       colsample_bytree=0.9)),
              ('regression', make_pipeline(StandardScaler(),
                                           LogisticRegression(C=100)))]

clf = StackingClassifier(estimators=estimators, 
                         final_estimator=SVC(),cv=10)

In [8]:
model = clf
score = cross_val_score(model, features, trainset.label,scoring='roc_auc', cv= 10).mean()
print('AUC score for trainset: '+ str(score))

AUC score for trainset: 0.9312414751405577


In [0]:
from sklearn.neighbors import KNeighborsClassifier

estimators = [('SVC',make_pipeline(StandardScaler(),
                                   SVC(C=10,gamma=1, probability=True))),
              ('rf', RandomForestClassifier(criterion='entropy',
                                            n_estimators=250, max_depth=30, 
                                            bootstrap=True)),
               ('boost', XGBClassifier(n_estimators= 900, gamma=0,
                                       max_depth=7, min_child_weight=0,
                                       learning_rate=0.1, subsample=0.85,
                                       colsample_bytree=0.9)),
              ('neighbors', KNeighborsClassifier(n_neighbors=4,
                                                 weights='distance')),
              ('regression', make_pipeline(StandardScaler(),
                                           LogisticRegression(C=100)))]

Stack = StackingClassifier(estimators=estimators, 
                         final_estimator=LogisticRegression(),cv=10)

In [13]:
model = clf
score = cross_val_score(model, features, trainset.label,scoring='roc_auc', cv= 10).mean()
print('AUC score for trainset: '+ str(score))

AUC score for trainset: 0.9716507406966123


With the addition of the KNeighborsClassifier, this is the highest score now, so I'm going to try to upload predictions with this model later on today I think.

In [14]:
model.fit(features,trainset.label)
predictions = model.predict_proba(test_features)[:,1]

sample_submission = pd.DataFrame({'index': testset['index'], 'label': predictions})
sample_submission.head()

Unnamed: 0,index,label
0,0,0.019745
1,1,0.437415
2,2,0.020058
3,3,0.984825
4,4,0.979247


In [0]:
filename = "my_prediction_results6.csv" #FOR THE 6th ATTEMPT LOOK AT NOTEBOOK 9: VOTING CLASSIFIER FOR "my_prediction_results6_real.csv"
sample_submission.to_csv(filename,index=False) #THIS IS ACTUALLY THE 7th ATTEMPT

In [23]:
clf.final_estimator

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

First, I'm going to figure out if I could exploit this technique a bit more and try to push it just a bit further:

In [0]:
from sklearn.tree import DecisionTreeClassifier

estimators = [('SVC',make_pipeline(StandardScaler(),
                                   SVC(C=10,gamma=1, probability=True))),
              ('rf', RandomForestClassifier(criterion='entropy',
                                            n_estimators=250, max_depth=30, 
                                            bootstrap=True)),
               ('boost', XGBClassifier(n_estimators= 900, gamma=0,
                                       max_depth=7, min_child_weight=0,
                                       learning_rate=0.1, subsample=0.85,
                                       colsample_bytree=0.9)),
              ('neighbors', KNeighborsClassifier(n_neighbors=4,
                                                 weights='distance')),
              ('tree', DecisionTreeClassifier(max_depth=8,
                                              criterion = 'entropy')),
              ('regression', make_pipeline(StandardScaler(),
                                           LogisticRegression(C=100)))]

clf = StackingClassifier(estimators=estimators, 
                         final_estimator=LogisticRegression(),cv=10)

In [17]:
model = clf
score = cross_val_score(model, features, trainset.label,scoring='roc_auc', cv= 10).mean()
print('AUC score for trainset: '+ str(score))

AUC score for trainset: 0.9712471840086521


Nope, this was no succes.. Back to the other model with score 0.9716.

Maybe we can try a **voting classifier**, which also includes this stacking model:

In [0]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import VotingClassifier
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression

# Training classifiers
clf1 = make_pipeline(StandardScaler(), SVC(C=10,gamma=1, probability=True))
clf2 = RandomForestClassifier(criterion='entropy', n_estimators=250, 
                              max_depth=30, random_state=0)
clf3 = XGBClassifier(n_estimators= 900, gamma=0, max_depth=7, 
                     min_child_weight=0, learning_rate=0.1, 
                     subsample=0.85, colsample_bytree=0.9)
clf4 = KNeighborsClassifier(n_neighbors=4, weights='distance')
clf5 = make_pipeline(StandardScaler(),LogisticRegression(C=100))
clf6 = Stack

eclf = VotingClassifier(estimators=[('svc', clf1), ('rf', clf2), 
                                    ('xgb', clf3), ('kkn', clf4), 
                                    ('reg', clf5), ('stack', clf6)]
                        , voting='soft')

In [6]:
for clf, name in zip([clf1, clf2, clf3, clf4, clf5, clf6, eclf], 
                      ['SVC', 'Random Forest', 'XGBoost', 'KNeighbors', 
                       'Logistic regression', 'stack', 'Ensemble']):
  scores = cross_val_score(clf, features, trainset.label, 
                           scoring='roc_auc', cv=10)
  print("AUC score: %0.5f (+/- %0.5f) [%s]" % (scores.mean(), scores.std(), name))

AUC score: 0.96253 (+/- 0.01182) [SVC]
AUC score: 0.93205 (+/- 0.01971) [Random Forest]
AUC score: 0.94602 (+/- 0.02002) [XGBoost]
AUC score: 0.95938 (+/- 0.01119) [KNeighbors]
AUC score: 0.65963 (+/- 0.03643) [Logistic regression]
AUC score: 0.97165 (+/- 0.01089) [stack]
AUC score: 0.96852 (+/- 0.01331) [Ensemble]


Ohn, this doesn't seem to have a better performance than the stacked model..

I try to assign some weights on the different submodels, just to try out and see what a difference it makes:

In [7]:
votingClass = VotingClassifier(estimators=[('svc', clf1), ('rf', clf2), 
                                    ('xgb', clf3), ('kkn', clf4), 
                                    ('reg', clf5), ('stack', clf6)]
                        , voting='soft', weights=[2,1,1,2,1,2])

for clf, name in zip([clf1, clf2, clf3, clf4, clf5, clf6, votingClass], 
                      ['SVC', 'Random Forest', 'XGBoost', 'KNeighbors', 
                       'Logistic regression', 'stack', 'Ensemble']):
  scores = cross_val_score(clf, features, trainset.label, 
                           scoring='roc_auc', cv=10)
  print("AUC score: %0.5f (+/- %0.5f) [%s]" % (scores.mean(), scores.std(), name))

AUC score: 0.96253 (+/- 0.01182) [SVC]
AUC score: 0.93205 (+/- 0.01971) [Random Forest]
AUC score: 0.94602 (+/- 0.02002) [XGBoost]
AUC score: 0.95938 (+/- 0.01119) [KNeighbors]
AUC score: 0.65963 (+/- 0.03643) [Logistic regression]
AUC score: 0.97163 (+/- 0.01101) [stack]
AUC score: 0.97092 (+/- 0.01219) [Ensemble]


This doesn't work as good as I thought.

If I now do exactly the same without the logistic regression model, once without assigned weights and once with them.

In [0]:
from sklearn.tree import DecisionTreeClassifier

estimators = [('SVC',make_pipeline(StandardScaler(),
                                   SVC(C=10,gamma=1, probability=True))),
              ('rf', RandomForestClassifier(criterion='entropy',
                                            n_estimators=250, max_depth=30, 
                                            bootstrap=True)),
               ('boost', XGBClassifier(n_estimators= 900, gamma=0,
                                       max_depth=7, min_child_weight=0,
                                       learning_rate=0.1, subsample=0.85,
                                       colsample_bytree=0.9)),
              ('neighbors', KNeighborsClassifier(n_neighbors=4,
                                                 weights='distance')),
              ('tree', DecisionTreeClassifier(max_depth=8,
                                              criterion = 'entropy'))]

stack = StackingClassifier(estimators=estimators, 
                         final_estimator=LogisticRegression(),cv=10)

In [10]:
model = stack
score = cross_val_score(model, features, trainset.label,scoring='roc_auc', cv= 10).mean()
print('AUC score for trainset: '+ str(score))

AUC score for trainset: 0.9713378222185562


In [0]:
clf1 = make_pipeline(StandardScaler(), SVC(C=10,gamma=1, probability=True))
clf2 = RandomForestClassifier(criterion='entropy', n_estimators=250, 
                              max_depth=30, random_state=0)
clf3 = XGBClassifier(n_estimators= 900, gamma=0, max_depth=7, 
                     min_child_weight=0, learning_rate=0.1, 
                     subsample=0.85, colsample_bytree=0.9)
clf4 = KNeighborsClassifier(n_neighbors=4, weights='distance')
clf5 = Stack

vote = VotingClassifier(estimators=[('svc', clf1), ('rf', clf2), 
                                    ('xgb', clf3), ('kkn', clf4), 
                                    ('stack', clf5)]
                        , voting='soft')

In [12]:
for clf, name in zip([clf1, clf2, clf3, clf4, clf5, vote], 
                      ['SVC', 'Random Forest', 'XGBoost', 'KNeighbors',
                       'stack', 'Ensemble']):
  scores = cross_val_score(clf, features, trainset.label, 
                           scoring='roc_auc', cv=10)
  print("AUC score: %0.5f (+/- %0.5f) [%s]" % (scores.mean(), scores.std(), name))

AUC score: 0.96253 (+/- 0.01182) [SVC]
AUC score: 0.93205 (+/- 0.01971) [Random Forest]
AUC score: 0.94602 (+/- 0.02002) [XGBoost]
AUC score: 0.95938 (+/- 0.01119) [KNeighbors]
AUC score: 0.97149 (+/- 0.01084) [stack]
AUC score: 0.97042 (+/- 0.01146) [Ensemble]


In [13]:
votingClass = VotingClassifier(estimators=[('svc', clf1), ('rf', clf2), 
                                    ('xgb', clf3), ('kkn', clf4), 
                                    ('stack', clf5)], 
                               voting='soft', weights=[2,1,1,2,2])

for clf, name in zip([clf1, clf2, clf3, clf4, clf5, votingClass], 
                      ['SVC', 'Random Forest', 'XGBoost', 'KNeighbors'
                      , 'stack', 'Ensemble']):
  scores = cross_val_score(clf, features, trainset.label, 
                           scoring='roc_auc', cv=10)
  print("AUC score: %0.5f (+/- %0.5f) [%s]" % (scores.mean(), scores.std(), name))

AUC score: 0.96253 (+/- 0.01182) [SVC]
AUC score: 0.93205 (+/- 0.01971) [Random Forest]
AUC score: 0.94602 (+/- 0.02002) [XGBoost]
AUC score: 0.95938 (+/- 0.01119) [KNeighbors]
AUC score: 0.97152 (+/- 0.01082) [stack]
AUC score: 0.97160 (+/- 0.01114) [Ensemble]


Now I will look for a certain choice of the weights...

In [16]:
from sklearn.model_selection import GridSearchCV

params = {'weights':[[2,1,1,2,2],[3,1,2,3,3],[3,1,2,2,3],[2,1,1,2,3]]}
grid_Search = GridSearchCV(param_grid = params, estimator=vote, scoring='roc_auc', cv=10)
grid_Search.fit(features, trainset.label)
grid_Search.best_params_
grid_Search.best_score_

0.9717210607944555

In [0]:
model = grid_Search.best_estimator_
score = cross_val_score(model, features, trainset.label,scoring='roc_auc', cv= 10).mean()
print('AUC score for trainset: '+ str(score))

Well...for this notebook I've had a lot of patience. Nevertheless, these results were not as good as the one I eventually obtained in the 9th notebook. However, this seems also a very good technique.