### 7: Random Forests!


First let's get the _MNIST_ dataset

In [1]:
import numpy as np
from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', version=1)
mnist.target = mnist.target.astype(np.int64)


In [10]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score

X_tmp, X_test, y_tmp, y_test = train_test_split(
    mnist.data,
    mnist.target,
    test_size=10000
)
X_train, X_validate, y_train, y_validate = train_test_split(
    X_tmp,
    y_tmp,
    test_size=10000
)

models = [
    ('Random Forest', RandomForestClassifier()),
    ('Support Vector', LinearSVC()),
    ('Stochastic Gradient', SGDClassifier()),
]

#The SVM can take a while here...
for name, model in models:
    print("Training {}...".format(name))
    model.fit(X_train, y_train)

for name, model in models:
    print("{}:\t{}".format(name, model.score(X_validate, y_validate)))


ensemble = VotingClassifier(models)
ensemble.fit(X_train, y_train)

ensemble.score(X_validate, y_validate)


Training Random Forest...




Training Support Vector...


KeyboardInterrupt: 

Great! We can see the ensemble performing better than the individual classifiers. Now let's see if we can do some stacking!

First we build a matrix of predictions, one column per estimator.

In [9]:
#We can get at the models again with the estimators_ attribute
model_predictions = np.hstack([m.predict(X_validate).reshape((-1, 1))
    for m in ensemble.estimators_])

model_predictions


NameError: name 'ensemble' is not defined

Now we can combine them with a random forest classifier, acting as a _blender_ trained on the predictions of the other models!

In [5]:
random_blender = RandomForestClassifier(n_estimators=100, oob_score=True)
random_blender.fit(model_predictions, y_validate)

random_blender.oob_score_


NameError: name 'model_predictions' is not defined

Okay the out of bag scores look good, how well does it perform on our test set?

In [7]:
test_predictions = np.hstack([m.predict(X_test).reshape((-1, 1))
    for m in ensemble.estimators_])

predictions = random_blender.predict(test_predictions)

accuracy_score(y_test, predictions)

NameError: name 'ensemble' is not defined

In [8]:
print(u'\U0001F64C')

🙌
