# Chapter 7: Random Forest and Ensemble Learning

This chapter introduces different kinds of ensemble methods, including voting, bagging, pasting, random forest, boosting and stacking. The book covers only how to use and fine-tune these methods. For further details on pinciples or advanced topics, please refer to other machine learning tutorials.

> This notebook contains my solution to the programming exercises of chapter 7. For answers to other quesions, see the markdown file under the same folder. **Note that my code may not be fully tested or evaluated, for example, grid search and cross validation may not be performed. I only choose hyperparameters that gives an acceptable result.**

## Exercise 8: Voting Classifier on MNIST

Requirement: Create a voting classifier on MNIST dataset. Check the performance of individual classifiers and the ensemble one.

First of all, let's prepare the dataset. Split the MNIST dataset into train, validation and test sets.

In [None]:
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata("MNIST original")
X, y = mnist["data"], mnist["target"]
X_train, y_train = X[:50000], y[:50000]
X_val, y_val = X[50000:60000], y[50000:60000]
X_test, y_test = X[60000:], y[60000:]

Next, create multiple base classifiers, including random forest, extreme randome tree and SVM. Train them with the same train set and evaluate them on validation set.

In [None]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import precision_score

scaler = MinMaxScaler()
scaler.fit_transform(X_train)

random_forest_clf = RandomForestClassifier(n_estimators=1000)
random_forest_clf.fit(X_train, y_train)

extra_tree_clf = ExtraTreesClassifier(n_estimators=1000)
extra_tree_clf.fit(X_train, y_train)

svm_clf = SVC(kernel="rbf")
svm_clf.fit(X_train, y_train)

scaler.transform(X_val)
random_forest_predictions_on_val = random_forest_clf.predict(X_val)
print("Random Forest Precision:", precision_score(y_val, random_forest_predictions_on_val))
extra_tree_predictions_on_val = extra_tree_clf.predict(X_val)
print("Extra Tree Precision:", precision_score(y_val, extra_tree_predictions_on_val))
svm_predictions_on_val = svm_clf.predict(X_val)
print("SVM Precision:", precision_score(y_val, svm_predictions_on_val))

Since the base estimators give class predictions rather than probabilities, we create a hard voting classifier to ensemble them. Check the performance of voting classifier on validation set.

In [None]:
from sklearn.ensemble import VotingClassifier

voting_clf = VotingClassifier(
    estimators=[("random_forest", RandomForest(n_estimators=1000)), ("extra_tree", ExtraTreesClassifier(n_estimators=1000)), ("svm", SVC())],
    voting="hard"
)
voting_clf.fit(X_train, y_train)

voting_predictions_on_val = voting_clf.predict(X_val)
print("Voting Classifier Precision:", precision_score(y_val, voting_predictions_on_val))

Finally, check the performance of all models on test set.

In [None]:
scaler.transform(X_test)

random_forest_predictions_on_test = random_forest_clf.predict(X_test)
print("Random Forest Precision:", precision_score(y_test, random_forest_predictions_on_test))

extra_tree_predictions_on_test = extra_tree_clf.predict(X_test)
print("Extra Trees Precision:", precision_score(y_test, extra_tree_predictions_on_test))

svm_predictions_on_test = svm_clf.predict(X_test)
print("SVM Precision:", precision_score(y_test, svm_predictions_on_test))

voting_predictions_on_test = voting_clf.predict(X_test)
print("Voting Precision:", precision_score(y_test, voting_predictions_on_test))

## Exercise 9: Stacking on MNIST

Requirement: Create a stacking model based on the basic classifiers above. Compare the performance of the stacking ensemble model with the voting classifier.

Since we've already have the base estimators, we directly use them as basic estimators to create a stacking model.

In [None]:
new_features = []
for i in range(10000):
    new_features.append([random_forest_predictions_on_val[i], extra_tree_predictions_on_val[i], svm_predictions_on_val[i]])
    
stacking_model = SVC(kernel="rbf", C=1, gamma="scale")
stacking_model.fit(new_features, y_val)

Let's check the performance of the stacking ensemble model.

In [None]:
new_test_features = []
for i in range(10000):
    new_test_features.append([random_forest_predictions_on_test[i], extra_tree_predictions_on_test[i], svm_predictions_on_test[i]])
    
stacking_predicions_on_test = stacking_model.predict(new_test_features)
print("Stacking Model Precision:", precision_score(y_test, stacking_predicions_on_test))