<a href="https://colab.research.google.com/github/ErickaJaneAlegre/CPEN-70/blob/main/CHAPTER_7_LAB_5_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## CHAPTER 7

---

###Ensemble Learning and Random Forests


###Laboratory Exercise 5

In [1]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "ensembles"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

In [3]:
#importing MNIST data

from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', version=1, as_frame=False)
mnist.target = mnist.target.astype(np.uint8)

In [4]:
def plot_digit(data):
    image = data.reshape(28, 28)
    plt.imshow(image, cmap = mpl.cm.hot,
               interpolation="nearest")
    plt.axis("off")

8. Load the MNIST data (introduced in Chapter 3), and split it into a training set, a validation set, and a test set (e.g., use 50,000 instances for training, 10,000 for validation, and 10,000 for testing). Then train various classifiers, such as a Random Forest classifier an Extra-Trees classifier, and an SVM. Next, try to combine them into an ensemble that outperforms them all on the validation set, using a soft or hard voting classifier. Once you have found one, try it on the test set. How much better does it perform compared to the individual classifiers?

In [5]:
from sklearn.model_selection import train_test_split

X_train_val, X_test, y_train_val, y_test = train_test_split(
    mnist.data, mnist.target, test_size=15000, random_state=30)
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=15000, random_state=30)

In [9]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier

random_forest_clf = RandomForestClassifier(n_estimators=100, random_state=30)
extra_trees_clf = ExtraTreesClassifier(n_estimators=100, random_state=30)
svm_clf = LinearSVC(max_iter=100, tol=20, random_state=30)
mlp_clf = MLPClassifier(random_state=30)

In [10]:
estimators = [random_forest_clf, extra_trees_clf, svm_clf, mlp_clf]
for estimator in estimators:
    print("Training the", estimator)
    estimator.fit(X_train, y_train)

Training the RandomForestClassifier(random_state=30)
Training the ExtraTreesClassifier(random_state=30)
Training the LinearSVC(max_iter=100, random_state=30, tol=20)
Training the MLPClassifier(random_state=30)


In [11]:
print("Estimator Score:")
[estimator.score(X_val, y_val) for estimator in estimators]

Estimator Score:


[0.9667333333333333, 0.9694, 0.819, 0.9612]

###Based on the result, it shows that the Linear SVM is far outperformed by the other classifiers.

In [12]:
#Combine them into an ensemble that outperforms them all on the validation set, using a soft or hard voting classifier.


from sklearn.ensemble import VotingClassifier

named_estimators = [
    ("random_forest_clf", random_forest_clf),
    ("extra_trees_clf", extra_trees_clf),
    ("svm_clf", svm_clf),
    ("mlp_clf", mlp_clf),
]

voting_clf = VotingClassifier(named_estimators)
voting_clf.fit(X_train, y_train)

In [13]:
voting_clf.score(X_val, y_val)

0.9686

In [15]:
[estimator.score(X_val, y_val) for estimator in voting_clf.estimators_]

[0.9667333333333333, 0.9694, 0.819, 0.9612]

Removing SVM to check if the performance will improve.

In [16]:
voting_clf.set_params(svm_clf=None)

In [17]:
#updating the list of estimator

voting_clf.estimators

[('random_forest_clf', RandomForestClassifier(random_state=30)),
 ('extra_trees_clf', ExtraTreesClassifier(random_state=30)),
 ('svm_clf', None),
 ('mlp_clf', MLPClassifier(random_state=30))]

In [18]:
voting_clf.estimators_

[RandomForestClassifier(random_state=30),
 ExtraTreesClassifier(random_state=30),
 LinearSVC(max_iter=100, random_state=30, tol=20),
 MLPClassifier(random_state=30)]

In [19]:
del voting_clf.estimators_[2]

voting_clf.score(X_val, y_val)

0.9712666666666666

In [21]:
#Setting the voting to soft

voting_clf.voting = "soft"
voting_clf.score(X_val, y_val)

0.9673333333333334

In [22]:
#Setting the voting to hard

voting_clf.voting = "hard"
voting_clf.score(X_val, y_val)

0.9712666666666666

It shows that hard voting classifier is better to used than soft voting classifier in combining various classifier into an ensemble that outperforms on the validation set.

In [24]:
[estimator.score(X_test, y_test) for estimator in voting_clf.estimators_]

[0.9656666666666667, 0.9681333333333333, 0.9631333333333333]

The voting classifier only very slightly reduced the error rate of the best model in this case.

9. Run the individual classifiers from the previous exercise to make predictions on the validation set, and create a new training set with the resulting predictions: each training instance is a vector containing the set of predictions from all your classifiers for an image, and the target is the image’s class. Train a classifier on this new training set. Congratulations, you have just trained a blender, and together with the classifiers they form a stacking ensemble! Now let’s evaluate the
ensemble on the test set. For each image in the test set, make predictions with all your classifiers, then feed the predictions to the blender to get the ensemble’s pre‐ dictions. How does it compare to the voting classifier you trained earlier?

In [26]:
X_val_predictions = np.empty((len(X_val), len(estimators)), dtype=np.float32)

for index, estimator in enumerate(estimators):
    X_val_predictions[:, index] = estimator.predict(X_val)

X_val_predictions

array([[5., 5., 5., 5.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       ...,
       [1., 1., 1., 1.],
       [5., 5., 5., 5.],
       [9., 9., 9., 9.]], dtype=float32)

In [30]:
rnd_forest_blender = RandomForestClassifier(n_estimators=500, oob_score=True, random_state=50)
rnd_forest_blender.fit(X_val_predictions, y_val)

In [31]:
rnd_forest_blender.oob_score_

0.9684

In [32]:
X_test_predictions = np.empty((len(X_test), len(estimators)), dtype=np.float32)

for index, estimator in enumerate(estimators):
    X_test_predictions[:, index] = estimator.predict(X_test)

y_pred = rnd_forest_blender.predict(X_test_predictions)

In [33]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

0.9685333333333334

The results reveal that the stacking ensemble does not outperform the voting classifier. It is not as effective as the best individual classifier.