# Exercise 8
Load the MNIST data (introduced in Chapter 3), and split it into a training set, a
validation set, and a test set (e.g., use 50,000 instances for training, 10,000 for validation,
and 10,000 for testing). Then train various classifiers, such as a Random
Forest classifier, an Extra-Trees classifier, and an SVM. Next, try to combine
them into an ensemble that outperforms them all on the validation set, using a
soft or hard voting classifier. Once you have found one, try it on the test set. How
much better does it perform compared to the individual classifiers?

https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python

In [1]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml("mnist_784", version = 1) # divided into ["data"] and ["target"]

Subset mnist dataset:

In [2]:
tr, vl, ts = 50000, 60000, 70000
train, val, test = [x for x in range(tr)], [x for x in range(tr, vl)], [x for x in range(vl, ts)]
X_train, X_val, X_test = mnist["data"][train], mnist["data"][val], mnist["data"][test]
y_train, y_val, y_test = mnist["target"][train], mnist["target"][val], mnist["target"][test]

In [21]:
import winsound
duration = 1000  # milliseconds
freq = 440  # Hz
#winsound.Beep(freq, duration)

Train classifiers: Random Forest, Boosted Trees, SVC

In [59]:
# imports
import time
import warnings
warnings.filterwarnings('ignore')
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
import os.path
from sklearn.externals import joblib

## Logistic Regression

We will perform logistic regression using stochastic gradient descent to fit the weights

In [67]:
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
if os.path.isfile("logregMNIST.pkl") == False:
    logreg_clf = Pipeline([("scale", StandardScaler()),                   
                        ("logregd_clf", SGDClassifier(loss = "log"))])


    time0 = time.time()
    logreg_clf.fit(X_train, y_train)
    time1 = time.time()
    print("Training time: {}".format(time1 - time0))
    winsound.Beep(freq, duration)
else:
    logreg_clf = joblib.load("logregMNIST.pkl")

In [61]:
# training metrics
time0 = time.time()
y_tr_pred = logreg_clf.predict(X_train)
time1 = time.time()
print("Pred time: {}".format(time1 - time0))

Pred time: 0.40926337242126465


In [62]:
def print_metrics(y_true, y_pred):
    # accuracy
    print("Accuracy: {}".format(accuracy_score(y_true, y_pred)))
    # f1
    print("f1 score: {}".format(f1_score(y_true, y_pred, average = "micro")))
    #recall
    print("Recall {}".format(recall_score(y_true, y_pred, average = "micro")))
    #sens
    print("Precision: {}".format(precision_score(y_true, y_pred, average = "micro")))
    
print_metrics(y_train, y_tr_pred)

Accuracy: 0.91536
f1 score: 0.91536
Recall 0.91536
Precision: 0.91536


## Random Forest

In [69]:
from sklearn.ensemble import RandomForestClassifier
if os.path.isfile("rfMNIST.pkl") == False:
    rf_clf = RandomForestClassifier(n_estimators = 200, max_depth = 5, n_jobs = 6, random_state = 1, oob_score = True )
    time0 = time.time()
    rf_clf.fit(X_train, y_train)
    time1 = time.time()
    print("Train time: {}".format(time1 - time0))
else:
    rf_clf = joblib.load("rfMNIST.pkl")

In [70]:
time0 = time.time()
y_tr_pred = rf_clf.predict(X_train)
time1 = time.time()
print("Pred time: {}".format(time1 - time0))

Pred time: 1.1285202503204346


In [71]:
print_metrics(y_train, y_tr_pred)

Accuracy: 0.86314
f1 score: 0.86314
Recall 0.86314
Precision: 0.86314


In [72]:
 rf_clf.oob_score_

0.85326

## Boosted Trees


In [73]:
import xgboost
if os.path.isfile("xgbMNIST.pkl") == False:
    xgb_clf = xgboost.XGBClassifier(objective='multi:softmax', n_jobs = 6)
    time0 = time.time()
    xgb_clf.fit(X_train, y_train)
    time1 = time.time()
    winsound.Beep(freq, duration)
    print("Train time: {}".format(time1 - time0))
else:
    xgb_clf = joblib.load("xgbMNIST.pkl")

In [74]:
time0 = time.time()
y_tr_pred = xgb_clf.predict(X_train)
time1 = time.time()
print("Pred time: {}".format(time1 - time0))

Pred time: 1.07338547706604


In [75]:
print_metrics(y_train, y_tr_pred)

Accuracy: 0.94448
f1 score: 0.94448
Recall 0.94448
Precision: 0.94448


## Extra-Trees classifier

In [93]:
from sklearn.tree import ExtraTreeClassifier
if os.path.isfile("etMNIST.pkl") == False:
    extra_clf = ExtraTreeClassifier()
    time0 = time.time()
    extra_clf.fit(X_train, y_train)
    time1 = time.time()
    winsound.Beep(freq, duration)
    print("Train time: {}".format(time1 - time0))
else:
     voting_clf = joblib.load("etMNIST.pkl")

Train time: 0.6375987529754639


Save the four models

In [94]:
joblib.dump(logreg_clf, "logregMNIST.pkl")
joblib.dump(rf_clf, "rfMNIST.pkl")
joblib.dump(xgb_clf, "xgbMNIST.pkl")
joblib.dump(xgb_clf, "etMNIST.pkl")

['etMNIST.pkl']

## Voting Classifier

I removed xgboost model because its peformance is a bit too good

In [97]:
if os.path.isfile("voting.pkl") == False:
    voting_clf = VotingClassifier(estimators = [("extra",extra_clf), ("rf", rf_clf)], 
                              voting = "hard")
    time0 = time.time()
    y_tr_pred = voting_clf.fit(X_train, y_train)
    time1 = time.time()
    winsound.Beep(freq, duration)
    print("Train time: {}".format(time1 - time0))
else:
     voting_clf = joblib.load("voting.pkl")

Train time: 25.1208975315094


In [88]:
time0 = time.time()
y_tr_pred = voting_clf.predict(X_train)
time1 = time.time()
print("Pred time: {}".format(time1 - time0))

Pred time: 1.9706382751464844


In [98]:
for clf in (extra_clf, rf_clf, voting_clf):
    y_pred = clf.predict(X_val)
    print(clf.__class__.__name__, accuracy_score(y_val, y_pred))

ExtraTreeClassifier 0.8227
RandomForestClassifier 0.877
VotingClassifier 0.8435


Voting is worse than a single model ¯\\_(ツ)_/¯ .