加载MNIST数据集（**8_sklearn做分类** 里有介绍），将其分为一个训练集、一个验证集和一个测试集（例如使用40000个实例训练，10000个实例验证，最后20000个实例测试）。然后训练多个分类器，比如一个随机森林分类器、一个极端随机树分类器和一个SVM。接下来，尝试使用软投票法或者硬投票法将它们组合成一个集成，这个集成在验证集上的表现要胜过它们各自单独的表现。成功找到集成后，在测试集上测试。与单个的分类器相比，它的性能要好多少？

In [1]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, as_frame=False, parser="auto")
mnist.data[:5]

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [2]:
X = mnist["data"]
y = mnist["target"]
X_train, X_val, X_test, y_train, y_val, y_test = X[:4000], X[4000:5000], X[5000:], y[:4000], y[4000:5000], y[5000:]

In [14]:
from sklearn.svm import LinearSVC
from sklearn.ensemble import VotingClassifier, RandomForestClassifier, ExtraTreesClassifier

voting_clf = VotingClassifier(estimators=[
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('ef', ExtraTreesClassifier(n_estimators=100, random_state=42)),
    ('svc', LinearSVC(C=0.01, max_iter=1000000, random_state=42))
])
voting_clf.fit(X_train, y_train)

In [15]:
for name, clf in voting_clf.named_estimators_.items():
    print(name, " val accuracy:", clf.score(X_val, y_val.astype(int)), "train accuracy:", clf.score(X_train, y_train.astype(int)), "test accuracy =", clf.score(X_test, y_test.astype(int)))

print("voting_clf test accuracy =", voting_clf.score(X_test,y_test))

rf  val accuracy: 0.928 train accuracy: 1.0 test accuracy = 0.9277846153846154
ef  val accuracy: 0.938 train accuracy: 1.0 test accuracy = 0.9362769230769231
svc  val accuracy: 0.809 train accuracy: 1.0 test accuracy = 0.8043538461538462
voting_clf test accuracy = 0.9316615384615384


运行上一题中的单个分类器，用验证集进行预测，然后用预测结果创建一个新的训练集：新训练集中的每个实例都是一个向量，这个向量包含所有分类器对于一个图像的一组预测，目标值是图像的类。训练一个混合器，结合第一层的分类器，它们一起构成了一个堆叠集成。现在在测试集上评估这个集成。对于测试集中的每个图像，使用所有的分类器进行预测，然后将预测结果提供给混合器，得到集成的预测。与前面训练的投票分类器相比，这个集成的结果如何？现在再次尝试使用StackingClassifier。你得到了更好的性能吗？如果是这样，为什么？

In [5]:
import numpy as np
X_val_predict = np.empty((len(X_val), len(voting_clf.estimators_)))
for i, name_clf in enumerate(voting_clf.named_estimators_.items()):
    _, clf = name_clf
    X_val_predict[:, i] = clf.predict(X_val)

final_estimator = RandomForestClassifier(random_state=43)
final_estimator.fit(X_val_predict, y_val)

In [6]:
X_test_predict = np.empty((len(X_test), len(voting_clf.estimators_)))
for i, name_clf in enumerate(voting_clf.named_estimators_.items()):
    _, clf = name_clf
    X_test_predict[:, i] = clf.predict(X_test)
final_estimator.score(X_test_predict, y_test)

0.9295384615384615

In [11]:
from sklearn.ensemble import StackingClassifier

stacking_clf = StackingClassifier(
     estimators = [("rf", RandomForestClassifier(n_estimators=300, max_leaf_nodes=32, random_state=42)),
                  ("et", ExtraTreesClassifier(n_estimators=300, max_leaf_nodes=32, random_state=42)),
                  ("svc", LinearSVC(random_state=42, max_iter=1000000, C=0.05, dual=True)),],
    final_estimator = RandomForestClassifier(random_state=43),
    cv = 5
)
stacking_clf.fit(X_train, y_train)

In [17]:
for name, clf in stacking_clf.named_estimators_.items():
    print(name, " val accuracy:", clf.score(X_val, y_val.astype(int)), "train accuracy:", clf.score(X_train, y_train.astype(int)), "test accuracy =", clf.score(X_test, y_test.astype(int)))

print("stacking_clf test accuracy =", stacking_clf.score(X_test,y_test))

rf  val accuracy: 0.871 train accuracy: 0.90125 test accuracy = 0.8633846153846154
et  val accuracy: 0.856 train accuracy: 0.88 test accuracy = 0.8492615384615385
svc  val accuracy: 0.805 train accuracy: 1.0 test accuracy = 0.7995384615384615
stacking_clf test accuracy = 0.9048923076923077
