# Exercise 9

Run the individual classifiers from the previous exercise to make predictions on
the validation set, and create a new training set with the resulting predictions:
each training instance is a vector containing the set of predictions from all your
classifiers for an image, and the target is the image’s class. Train a classifier on
this new training set. Congratulations, you have just trained a blender, and
together with the classifiers they form a stacking ensemble! Now let’s evaluate the
ensemble on the test set. For each image in the test set, make predictions with all
your classifiers, then feed the predictions to the blender to get the ensemble’s predictions.
How does it compare to the voting classifier you trained earlier?

In [7]:
import time
import warnings
import numpy as np
warnings.filterwarnings('ignore')
import winsound
duration = 1000  # milliseconds
freq = 440  # Hz
#winsound.Beep(freq, duration)

In [2]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml("mnist_784", version = 1) # load dataset

In [3]:
tr, vl, ts = 50000, 60000, 70000
train, val, test = [x for x in range(tr)], [x for x in range(tr, vl)], [x for x in range(vl, ts)]
X_train, X_val, X_test = mnist["data"][train], mnist["data"][val], mnist["data"][test]
y_train, y_val, y_test = mnist["target"][train], mnist["target"][val], mnist["target"][test]

In [4]:
# load models
from sklearn.externals import joblib
logreg_clf = joblib.load("logregMNIST.pkl")
rf_clf = joblib.load("rfMNIST.pkl")
xgb_clf = joblib.load("xgbMNIST.pkl")

Run models on validation set and store the predictions. They will be the new features for our blender model.

In [8]:
time0 = time.time()
X_lr = logreg_clf.predict(X_val)
X_rf = rf_clf.predict(X_val)
X_xgb = xgb_clf.predict(X_val)
time1 = time.time()
winsound.Beep(freq, duration)
print("Pred time: {}".format(time1 - time0))


Pred time: 0.5414106845855713


In [14]:
X_stack = np.c_[X_lr, X_rf, X_xgb]
# let's use a random forest as a blender
from sklearn.ensemble import RandomForestClassifier
blender = RandomForestClassifier()
blender.fit(X_stack, y_val)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [41]:
class StackingClassifier():
    """
    class for stacking classifier:
    arguments: list of classifiers (already fitted scikit-learn models), blender (sklearn model)
    methods: fit, predict
    """
    def __init__(self, classifiers, blender):
        self.clf = classifiers
        self.bl = blender
    def process(self, X):
        """
        This method processes X features in X_stack predictions from the stacked classifiers
        """
        # gather predictions from stacked models
        X_stack = []
        for clf in self.clf:
            X_stack.append(clf.predict(X))
        X_stack = np.c_[X_stack].transpose()
        return(X_stack)        
    def fit(self, X, y):
        """
        trains the blender model
        """
        # gather predictions from stacked models
        X_stack = self.process(X)
        # train the blender
        self.bl.fit(X_stack, y)        
    def predict(self, X):
        """
        returns prediction from the blender model
        """
        # process new features X into predictions of the stacked models
        X_stack = self.process(X)
        return(self.bl.predict(X_stack))
        

In [42]:
stack_clf = StackingClassifier([logreg_clf, rf_clf, xgb_clf], RandomForestClassifier())
stack_clf.fit(X_val, y_val)

In [43]:
preds = stack_clf.predict(X_test)

In [48]:
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
print("Stacking accuracy: {}".format(accuracy_score(y_test, preds)))
print("Log Reg accuracy: {}".format(accuracy_score(y_test, logreg_clf.predict(X_test))))
print("RF accuracy: {}".format(accuracy_score(y_test, rf_clf.predict(X_test))))
print("xgboost accuracy: {}".format(accuracy_score(y_test, xgb_clf.predict(X_test))))

Stacking accuracy: 0.9288
Log Reg accuracy: 0.9122
RF accuracy: 0.8703
xgboost accuracy: 0.9372


Xgboost still king