<a href="https://colab.research.google.com/github/PenLoo98/Python-Practice/blob/main/%EA%B8%B0%EA%B3%84%ED%95%99%EC%8A%B5%ED%94%84%EB%A1%9C%EA%B7%B8%EB%9E%98%EB%B0%8D_9%EC%A3%BC%EC%B0%A8_%EC%8B%A4%EC%8A%B5%EA%B3%BC%EC%A0%9C.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [24]:
import numpy as np
from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', version=1)
mnist.target = mnist.target.astype(np.uint8)

MNIST를 불러옵니다

In [25]:
from sklearn.model_selection import train_test_split
X_train_val, X_test, y_train_val, y_test = train_test_split(
    mnist.data, mnist.target, test_size=10000, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=10000, random_state=42)

불러온 데이터셋을 훈련/검증/테스트 세트로 나눕니다.

In [54]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier

random_forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
extra_trees_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)
svm_clf = LinearSVC(max_iter=100, tol=20, random_state=42)
mlp_clf = MLPClassifier(random_state=42)

estimators = [random_forest_clf, extra_trees_clf, svm_clf, mlp_clf]
for estimator in estimators:
    print("Training the", estimator)
    estimator.fit(X_train, y_train)

Training the RandomForestClassifier(random_state=42)
Training the ExtraTreesClassifier(random_state=42)
Training the LinearSVC(max_iter=100, random_state=42, tol=20)
Training the MLPClassifier(random_state=42)


In [55]:
[estimator.score(X_val, y_val) for estimator in estimators]

[0.9692, 0.9715, 0.859, 0.9639]

[학습 결과]  
랜덤 포레스트: 0.9692  
엑스트라 트리: 0.9715  
서브 벡터 머신: 0.859  
MLP: 0.9639



In [56]:
from sklearn.ensemble import VotingClassifier

named_estimators = [
    ("random_forest_clf", random_forest_clf),
    ("extra_trees_clf", extra_trees_clf),
    ("svm_clf", svm_clf),
    ("mlp_clf", mlp_clf),
]
voting_clf = VotingClassifier(named_estimators)
voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('random_forest_clf',
                              RandomForestClassifier(random_state=42)),
                             ('extra_trees_clf',
                              ExtraTreesClassifier(random_state=42)),
                             ('svm_clf',
                              LinearSVC(max_iter=100, random_state=42, tol=20)),
                             ('mlp_clf', MLPClassifier(random_state=42))])

In [57]:
voting_clf.score(X_val, y_val)

0.9708

앙상블 결과: 0.9708

In [58]:
[estimator.score(X_val, y_val) for estimator in voting_clf.estimators_]

[0.9692, 0.9715, 0.859, 0.9639]

[학습 결과]  
랜덤 포레스트: 0.9692  
엑스트라 트리: 0.9715  
서브 벡터 머신: 0.859  
MLP: 0.9639

[결과 분석]  
엑스트라 트리 > 앙상블 > 랜덤 포레스트 > MLP > 서브 벡터 머신  
순서로 좋은 성능을 보였다.  
앙상블의 경우 엑스트라 트리를 제외한 개개인의 분류기와 비교해서
0.2%~13%의 성능향상을 보였다.



In [59]:
del voting_clf.estimators_[2]
voting_clf.score(X_val, y_val)

0.9736

가장 낮은 성능을 보인 SVM을 제거 후  
앙상블을 테스트한 결과 성능이 0.2% 향상되었다

In [60]:
voting_clf.voting = "soft"
voting_clf.score(X_val, y_val)

0.97

In [61]:
voting_clf.voting = "hard"
voting_clf.score(X_test, y_test)

0.9704

간접 투표와 직접 투표를 두 방법으로 평가한 결과   
직접 투표 방법이 약간의 더 높은 성능을 보여주었다.

In [62]:
[estimator.score(X_test, y_test) for estimator in voting_clf.estimators_]

[0.9645, 0.9691, 0.9604]

In [63]:
# 9. 스태킹 앙상블
X_val_predictions = np.empty((len(X_val), len(estimators)), dtype=np.float32)

for index, estimator in enumerate(estimators):
    X_val_predictions[:, index] = estimator.predict(X_val)

In [64]:
X_val_predictions

array([[5., 5., 5., 5.],
       [8., 8., 8., 8.],
       [2., 2., 3., 2.],
       ...,
       [7., 7., 7., 7.],
       [6., 6., 6., 6.],
       [7., 7., 7., 7.]], dtype=float32)

In [65]:
rnd_forest_blender = RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42)
rnd_forest_blender.fit(X_val_predictions, y_val)

RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42)

In [66]:
rnd_forest_blender.oob_score_

0.9684

In [67]:
X_test_predictions = np.empty((len(X_test), len(estimators)), dtype=np.float32)

for index, estimator in enumerate(estimators):
    X_test_predictions[:, index] = estimator.predict(X_test)
y_pred = rnd_forest_blender.predict(X_test_predictions)
from sklearn.metrics import accuracy_score

In [68]:
accuracy_score(y_test, y_pred)

0.9671

앞서 만든 투표 분류기에 비해   
블렌더와 스태킹 앙상블 모두  
0.7%정도 떨어진 성능을 보였다.
