## week 6: 개인과제 

In [25]:
import numpy as np
import pandas as pd

8. Load the MNIST data (introduced in Chapter 3), and split it into a training set, a
validation set, and a test set (e.g., use 50,000 instances for training, 10,000 for validation, and 10,000 for testing). Then train various classifiers, such as a Random
Forest classifier, an Extra-Trees classifier, and an SVM. Next, try to combine
them into an ensemble that outperforms them all on the validation set, using a
soft or hard voting classifier. Once you have found one, try it on the test set. How
much better does it perform compared to the individual classifiers?

In [1]:
# MNIST 데이터 불러오기

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
mnist.keys()

dict_keys(['data', 'target', 'frame', 'categories', 'feature_names', 'target_names', 'DESCR', 'details', 'url'])

In [2]:
mnist.data.shape

(70000, 784)

In [3]:
# training set, validation set, test set으로 나누기 (test_size = 50000,10000,10000)

from sklearn.model_selection import train_test_split

X_train_val, X_test, y_train_val, y_test = train_test_split(mnist.data, mnist.target, test_size=10000, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=10000, random_state=42)

In [10]:
# 다양한 classifirer들로 학습시키기

from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.svm import LinearSVC

random_forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
extra_trees_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)
svm_clf = LinearSVC(max_iter=100, tol=20, random_state=42)

In [5]:
estimators = [random_forest_clf, extra_trees_clf, svm_clf]

for estimator in estimators:
    estimator.fit(X_train, y_train)

In [6]:
[estimator.score(X_val, y_val) for estimator in estimators]

[0.9692, 0.9715, 0.859]

In [7]:
from sklearn.ensemble import VotingClassifier

named_estimators = [
    ('random_forest_clf', random_forest_clf),
    ('extra_trees_clf', extra_trees_clf),
    ('svm_clf', svm_clf)
]

voting_clf_hard = VotingClassifier(named_estimators)
voting_clf_soft = VotingClassifier(named_estimators, voting='soft')

voting_clf_hard.fit(X_train,y_train)
voting_clf_soft.fit(X_train,y_train)

VotingClassifier(estimators=[('random_forest_clf',
                              RandomForestClassifier(random_state=42)),
                             ('extra_trees_clf',
                              ExtraTreesClassifier(random_state=42)),
                             ('svm_clf',
                              LinearSVC(max_iter=100, random_state=42,
                                        tol=20))],
                 voting='soft')

In [12]:
voting_clf_hard.score(X_val,y_val)

[0.9693]

In [14]:
voting_clf_soft.score(X_val,y_val)

AttributeError: 'LinearSVC' object has no attribute 'predict_proba'

In [15]:
# SVM 빼고 2개로 voting 

voting_clf_hard.set_params(svm_clf=None)

VotingClassifier(estimators=[('random_forest_clf',
                              RandomForestClassifier(random_state=42)),
                             ('extra_trees_clf',
                              ExtraTreesClassifier(random_state=42)),
                             ('svm_clf', None)])

In [18]:
voting_clf_hard.estimators_ #  trained estimator에서는 svm이 제거되지 않음

[('random_forest_clf', RandomForestClassifier(random_state=42)),
 ('extra_trees_clf', ExtraTreesClassifier(random_state=42)),
 ('svm_clf', None)]

In [19]:
# trained estimator에서 제거

del voting_clf_hard.estimators_[2]

In [20]:
voting_clf_hard.score(X_val,y_val)

0.9713

In [22]:
# soft voting 형식으로 

del voting_clf_soft.estimators_[2]

In [23]:
voting_clf_soft.score(X_val,y_val)

0.9719

9. Run the individual classifiers from the previous exercise to make predictions on
the validation set, and create a new training set with the resulting predictions:
each training instance is a vector containing the set of predictions from all your
classifiers for an image, and the target is the image’s class. Train a classifier on
this new training set. Congratulations, you have just trained a blender, and
together with the classifiers they form a stacking ensemble! Now let’s evaluate the
ensemble on the test set. For each image in the test set, make predictions with all
your classifiers, then feed the predictions to the blender to get the ensemble’s pre‐
dictions. How does it compare to the voting classifier you trained earlier?

In [36]:
X_val_predictions = np.empty((len(X_val), len(estimators)), dtype=np.float32)

for index, estimator in enumerate(estimators):
    X_val_predictions[:,index] = estimator.predict(X_val)

In [37]:
X_val_predictions

array([[5., 5., 5.],
       [8., 8., 8.],
       [2., 2., 3.],
       ...,
       [7., 7., 7.],
       [6., 6., 6.],
       [7., 7., 7.]], dtype=float32)

In [38]:
rnd_forest_blender = RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42)
rnd_forest_blender.fit(X_val_predictions, y_val)

# oob_score: 일반화 정확도를 줄이기 위해 밖의 샘플 사용 여부 

RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42)

In [39]:
rnd_forest_blender.oob_score_

0.9703

In [41]:
X_test_predictions = np.empty((len(X_test), len(estimators)), dtype=np.float32)

for index, estimator in enumerate(estimators):
    X_test_predictions[:,index] = estimator.predict(X_test)

In [42]:
y_pred = rnd_forest_blender.predict(X_test_predictions)

In [43]:
from sklearn.metrics import accuracy_score

In [44]:
accuracy_score(y_test,y_pred)

0.9661