# Exercises

1. If you have trained 5 different models on the exact same training data, & they all achieve 95% precision, is there any chance that you can combine these models to get better results? If so, how? If not, why?
2. What is the difference between hard & soft voting classifiers?
3. Is it possible to speed up training of a bagging ensemble by distributing it across multiple servers? What about pasting ensembles, boosting ensembles, random forests, or stacking ensembles?
4. What is the benefit of out-of-bag evaluation?
5. What makes extra-trees more random than regular random forests? How can this extra randomness help? Are extra-trees slower or faster than regular random forests?
6. If your AdaBoost ensemble underfits the training data, which hyperparameters should you tweak & how?
7. If your gradient boosting ensemble overfits the training set, should you increase or decrease the learning rate?
8. Load the MNIST data, & split it into a training set, a validation set, & a test set (e.g., use 50,000 instances for training, 10,000 for validation, & 10,000 for testing). Then train various classifiers, such as a random forest classifier, an extra-trees classifier, & a SVM classifier. Next, try to combine them into an ensemble that outperforms each individual classifier on the validation set, using soft or hard voting. Once you have found one, try it on the test set. How much better does it perform compared to the individual classifiers?
9. Run the individual classifers from the previous exercise to make predictions on the validation set, & create a new training set with the resulting predictions: each training instance is a vector containing the set of predictions from all your classifiers for an image, & the target is the image's class. Train a classifier on this new training set. Congratulations, you have just trained a blender, & together with the classifiers it forms a stacking ensemble! Now evaluate the ensemble on the test set. For each image in the test set, make predictions with all your classifiers, then feed the predictions to the blender to get the ensemble's predictions. how does it compare to the voting classifier you trained earlier?

---

1. Yes, it is possible. If the models are sufficiently different from one another (independent), they can combine to achieve a higher precision. However, since all models are trained on the same data, it is likely that they will make similar errors, which could reduce the ensemble's precision.
2. Hard voting classifiers aggregate the predictions of its predictors & predicts the class that gets the most votes for any given instance. Soft margin classifiers aggregates the class probabilities of each predictor & predicts the class that gets the highest average probability for any given instance.
3. Yes for bagging, pasting, random forests. Boosting won't see much of a difference in training time because they train their predictors sequentially. For stacking, you can train each layer in parallel or on multiple servers, but like boosting, they (predictors) need to wait for the predictors in the previous layer to finish training before they can be trained.
4. With bagging, since a portion of the training instances are not sampled at all, bagging models can be evaluated on those training instances. It will give you an estimate of how well your model will perform on the test set.
5. With random forests, it is generally trained with bagging, so there are random subsets of the training set used to train the model. It also searches for the best feature among a random subset of features when splitting a node (greedy CART algorithm, always want to minimise the weighted sum of gini impurities). The extra-trees introduces more randomness by randomly selecting a feature to split for a node. This increases the bias, but lowers the variance of our model. It also makes extra-trees faster, because it doesn't have to find the best feature to split at every node, as it is one of the most time-consuming tasks of growing a tree.
6. You can increase the number of estimators in your AdaBoost ensemble. You should also tweak the hyperparameters of your base estimator to increase complexity.
7. If your gradient boosting ensemble overfits the training set, you should increase the learning rate.

# 8.

In [2]:
from sklearn.datasets import fetch_openml
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import StandardScaler

mnist = fetch_openml("mnist_784", version = 1, as_frame = False)
mnist.keys()
X, y = mnist["data"].astype(np.intc), mnist["target"].astype(np.intc)

strat_split = StratifiedShuffleSplit(n_splits = 1, test_size = 10000, random_state = 32)
for train_index, test_index in strat_split.split(X, y):
    X_train, y_train = X[train_index], y[train_index]
    X_test, y_test = X[test_index], y[test_index]
for train_index, val_index in strat_split.split(X_train, y_train):
    X_train, y_train = X[train_index], y[train_index]
    X_val, y_val = X[val_index], y[val_index]
    
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.fit_transform(X_val)
X_test_scaled = scaler.fit_transform(X_test)

In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score

forest_classifier = RandomForestClassifier()
extratrees_classifier = ExtraTreesClassifier()
svm_classifier = SVC()
hardvoting_classifier = VotingClassifier(estimators = [("forest", forest_classifier), 
                                                       ("extratrees", extratrees_classifier), 
                                                       ("svc", svm_classifier)],
                                         voting = "hard")

for count, classifier in enumerate([forest_classifier, extratrees_classifier, svm_classifier, hardvoting_classifier]):
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_val)
    if count != 3:
        print(classifier.__class__.__name__, accuracy_score(y_val, y_pred))
    else:
        print(f"Hard{classifier.__class__.__name__}", accuracy_score(y_val, y_pred))

RandomForestClassifier 0.9646
ExtraTreesClassifier 0.9687
SVC 0.9764
HardVotingClassifier 0.97


In [4]:
forest_classifier = RandomForestClassifier()
extratrees_classifier = ExtraTreesClassifier()
svm_classifier = SVC(probability = True)
softvoting_classifier = VotingClassifier(estimators = [("forest", forest_classifier), 
                                                       ("extratrees", extratrees_classifier), 
                                                       ("svc", svm_classifier)],
                                         voting = "soft")
for count, classifier in enumerate([forest_classifier, extratrees_classifier, svm_classifier, softvoting_classifier]):
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_val)
    if count != 3:
        print(classifier.__class__.__name__, accuracy_score(y_val, y_pred))
    else:
        print(f"Soft{classifier.__class__.__name__}", accuracy_score(y_val, y_pred))

RandomForestClassifier 0.9648
ExtraTreesClassifier 0.9677
SVC 0.9764
SoftVotingClassifier 0.9751


The soft voting classifier performs slightly better than the hard voting classifier. Let's see if this is the case for the test set.

In [5]:
forest_classifier = RandomForestClassifier()
extratrees_classifier = ExtraTreesClassifier()
svm_classifier = SVC()
hardvoting_classifier = VotingClassifier(estimators = [("forest", forest_classifier), 
                                                       ("extratrees", extratrees_classifier), 
                                                       ("svc", svm_classifier)],
                                         voting = "hard")

for count, classifier in enumerate([forest_classifier, extratrees_classifier, svm_classifier, hardvoting_classifier]):
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    if count != 3:
        print(classifier.__class__.__name__, accuracy_score(y_test, y_pred))
    else:
        print(f"Hard{classifier.__class__.__name__}", accuracy_score(y_test, y_pred))

RandomForestClassifier 0.9899
ExtraTreesClassifier 0.9908
SVC 0.9846
HardVotingClassifier 0.9915


In [6]:
forest_classifier = RandomForestClassifier()
extratrees_classifier = ExtraTreesClassifier()
svm_classifier = SVC(probability = True)
softvoting_classifier = VotingClassifier(estimators = [("forest", forest_classifier), 
                                                       ("extratrees", extratrees_classifier), 
                                                       ("svc", svm_classifier)],
                                         voting = "soft")
for count, classifier in enumerate([forest_classifier, extratrees_classifier, svm_classifier, softvoting_classifier]):
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    if count != 3:
        print(classifier.__class__.__name__, accuracy_score(y_test, y_pred))
    else:
        print(f"Soft{classifier.__class__.__name__}", accuracy_score(y_test, y_pred))

RandomForestClassifier 0.9898
ExtraTreesClassifier 0.99
SVC 0.9846
SoftVotingClassifier 0.9929


---

# 9.