# Ensemble Methods
## Experimenting on MNIST dataset

Let's use sklearn's implementations of the Bagging, Random Forest, Weighted Voting and Stacking models and perform experiments on MNIST dataset. First, we will use only 2 digits from the dataset and then will use all 10 digits. 

In [1]:
import pandas as pd
import numpy as np
import time
import matplotlib.pyplot as plt
plt.style.use("seaborn-v0_8-whitegrid") # Plot style

from sklearn.metrics import accuracy_score
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier

%load_ext autoreload
%autoreload 2

In [2]:
mnist_train = pd.read_csv("data/raw/mnist_train.csv", header=None)
mnist_test = pd.read_csv("data/raw/mnist_test.csv", header=None)

  mnist_train = pd.read_csv('Datasets/mnist_train.csv', header=None)
  mnist_test = pd.read_csv('Datasets/mnist_test.csv', header=None)


In [3]:
X_train, y_train = mnist_train.iloc[:2000, 1:], mnist_train.iloc[:2000, 0]
X_test, y_test = mnist_test.iloc[:1000, 1:], mnist_test.iloc[:1000, 0]

## 1. Bagging

In [4]:
X_train.shape

(2000, 784)

In [5]:
start = time.perf_counter()
classifier = BaggingClassifier(n_estimators=10, random_state=0)
classifier.fit(X_train, y_train)
print(f"Time taken: {time.perf_counter() - start} sec")
print("Train error rate: ", 1 - accuracy_score(classifier.predict(X_train), y_train))
print("Test error rate: ", 1 - accuracy_score(classifier.predict(X_test), y_test))

TypeError: '<' not supported between instances of 'int' and 'str'

In [None]:
# ensemble_sizes = range(1, 51, 4)
ensemble_sizes = range(50, 65, 4)
train_error = []
test_error = []

for ensemble_size in ensemble_sizes:
    model = BaggingClassifier(n_estimators=ensemble_size, random_state=0)
    model.fit(X_train, y_train)
    train_error.append(1 - accuracy_score(model.predict(X_train), y_train))
    test_error.append(1 - accuracy_score(model.predict(X_test), y_test))

plt.plot(ensemble_sizes, train_error)
plt.plot(ensemble_sizes, test_error)
plt.xticks(ensemble_sizes)
plt.xlabel("Number of estimators")
plt.ylabel("Error Rate")
plt.legend(["Train", "Test"])
plt.show()

## 2. Random Forest (RF)

In [None]:
start = time.perf_counter()
classifier = RandomForestClassifier(n_estimators=50, max_features=50, random_state=0)
classifier.fit(X_train, y_train)
print(f"Time taken: {time.perf_counter()-start} sec")
print("Train error rate: ", 1 - accuracy_score(classifier.predict(X_train), y_train))
print("Test error rate: ", 1 - accuracy_score(classifier.predict(X_test), y_test))

In [None]:
ensemble_sizes = range(1, 51, 4)
for nr_features in [10, 50, 300, "auto"]:
    error_rates = {"train":[], "test":[]}

    for ensemble_size in ensemble_sizes:
        model = RandomForestClassifier(n_estimators = ensemble_size, random_state = 0, max_features=nr_features)
        model.fit(X_train, y_train)

        error_rates["train"].append(1 - accuracy_score(model.predict(X_train), y_train))
        error_rates["test"].append(1 - accuracy_score(model.predict(X_test), y_test))
    
    plt.plot(ensemble_sizes, error_rates["train"], label=f"{nr_features}", linestyle="--", linewidth=2)
    plt.plot(ensemble_sizes, error_rates["test"], label=f"{nr_features}", linewidth=2)

plt.xlabel("Number of trees")
plt.ylabel("Error Rate")
plt.xticks(ensemble_sizes)
plt.legend(ncol=2)
plt.show()

In [None]:
# Returns importance of each feature
pd.DataFrame({"pixel":np.arange(28**2), "importance": model.feature_importances_}).sort_values("importance", ascending=False)

Now, let's use all the digits in the dataset. The ideal scenario:
1. find the optimal number of estimators on training dataset (e.g. using Cross-Validation),
2. train an ensemble model with optimal number of estimators (i.e. individual models) on training dataset,
3. test the trained ensemble model on testing dataset.

In [None]:
X_train, y_train = mnist_train.iloc[:,1:], mnist_train.iloc[:,0]
X_test, y_test = mnist_test.iloc[:,1:], mnist_test.iloc[:,0]

In [None]:
start = time.perf_counter()
classifier = BaggingClassifier(n_estimators=17, random_state=0) # Suppose, the optimal number of estimators is 17 for Bagging
classifier.fit(X_train, y_train)
print(f"Time taken: {time.perf_counter() - start} sec")
print("Train accuracy: ", accuracy_score(classifier.predict(X_train), y_train))
print("Test accuracy: ", accuracy_score(classifier.predict(X_test), y_test))

In [None]:
start = time.perf_counter()
classifier = RandomForestClassifier(n_estimators=45, random_state=0) # Suppose, the optimal number of estimators is 45 for RF
classifier.fit(X_train, y_train)
print(f"Time taken: {time.perf_counter() - start} sec")
print("Train accuracy: ", accuracy_score(classifier.predict(X_train), y_train))
print("Test accuracy: ", accuracy_score(classifier.predict(X_test), y_test))

Here, both classifiers overfit on training data, and the test accuracy for:
1. Bagging should be around 92%, because Bagging uses all features (total: 764 features),
2. Random Forest should be around 95%, because Random Forest does not use all features.

The other 2 algorithms are implemented in similar ways as Bagging and Random Forest.

## 3. Weighted Voting

## 4. Stacking