### 1) If you have trained five different models on the exact same training data, and they all achieve 95% precision, is there any chance that you can combine these models to get better results? if so, how? If not, why?

Yes, you can. Even though the models are not independent from one-another, their predictions will, probably, be somewhat different. Hence, using an ensemble method combining all these estimates (like just a soft or hard voting classifier) could increase precision

### 2) What is the difference between hard and soft voting classifeiers?

Hard voting classifiers decide the predicted as the the most common among the individuals predictor's classification (the mode). Soft voting takes the average of each class probability over all classifier, and the final prediction becomes the class with the highest average probability. It is worth noting that soft voting only works if all classifiers are able to predict probabilities (i.e., they all have predict_proba() method).

### 3) Is it possible to speed up training of a bagging ensemble by distributing it across multiple servers? What about pasting ensembles, boosting ensembles, random forests, or stacking ensembles?

It is possible for bagging, pasting and random forest methods, since each predictor (within each ensemble method) does not depend on the other predictor's predictions. Since it is not the case for boosting ensembles, that train each model within the ensemble with the results of the previous model, boosting methods cannot be trained in parallel. For stacking it may be possible if all predictors used are independent from one-another; in this case, you can train each model in parallel and then, in the final step (the stacking process), where you combine all estimaves, you wont be able to run in parellel.

### 4) What is the benefit of out-of-bag evaluation?

When a bagging ensemble is trained, it does not use all the instances in the sample to train each model within the ensemble. Since most instances will be out-of-bag for some estimators, these estimators predictions can be used to make "out-of-sampleish" estimates for these instances without the need for a validation set.

### 5) What makes extra-trees ensembles more random than regular random forests? How can this extra randomness help? Are extra-trees classifiers slower or faster than regular random forests?

Extra-trees ensembles work just like random forests, but by setting random cutoff points at each node. This way, the training process does not have to calculate the best cutoff point for each node of each tree, making the training process much faster.

### 6) If your AdaBoost ensemble underfits the training data, which hyperparameter should you tweak, and how?

You should try to increase the number of estimators used.

### 7) If your gradient boosting ensemble overfits the training set, should you increase or decresase the learning rate?

You should decrease the learning rate.

### 8) Load the MNIST dataset, and split it into a training set, a validadtion set, and a test set (e.g., use 50,000 instances for training, 10,000 for validation and 10,000 for testing). Then train various classifers, such as random forest classifer, an extra-trees classifer, and an SVM classifer. Next, try to combine them into an ensemble that outperforms each individual classifier on the validation set, using soft or hard voting. Once you have found one, try it on the test set. How much better does it perform compared to the individual classifiers?

In [None]:
#open MNIST dataset and split into training, validation and testing
import sklearn as skl
from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', as_frame=False)
X, y = mnist.data, mnist.target
X_train, X_validation, X_test, y_train, y_validation, y_test = X[:50000], X[50000:60000], X[60000:], y[:50000], y[50000:60000] ,y[60000:]

In [None]:
#Train Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

rf_clf = RandomForestClassifier(n_jobs = -3, random_state = 42)

#Choose some hyperparameters by Grid Search
param_grid = [{'n_estimators':[50, 100, 200], 'max_features':[5, 10, 50, 100, 'sqrt']}]
grid_search_rf = GridSearchCV(rf_clf, param_grid, cv = 3, scoring = "accuracy")
grid_search_rf.fit(X_train, y_train)
best_rf = grid_search_rf.best_estimator_
print(grid_search_rf.best_score_)

0.9636800003709514
{'max_features': 50, 'n_estimators': 200}


In [None]:
#Train extra-trees classifier
from sklearn.ensemble import ExtraTreesClassifier
et_clf = ExtraTreesClassifier(random_state = 42, n_jobs = -3)

#Choose some hyperparameters by Grid Search
param_grid = [{
    'n_estimators':[50, 100, 200], 
    'max_features':[5, 10, 50, 100, 'sqrt']
    }]
grid_search_et = GridSearchCV(et_clf, param_grid, cv = 3, scoring = "accuracy")
grid_search_et.fit(X_train, y_train)
best_et = grid_search_et.best_estimator_
print(grid_search_et.best_score_)

0.9683399911744955


In [35]:
#Train SVM Classifier
from sklearn.svm import SVC
import numpy as np
svc_clf = SVC() 
svc_clf.fit(X_train, y_train, random_state = 42)
np.mean(cross_val_score(svc_clf, X_train, y_train, n_jobs = -3))

np.float64(0.9747)

In [44]:
#Train KNN classifier
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier()
#Choose some hyperparameters by Grid Search
param_grid = [{
    'n_neighbors':[1, 5, 10, 50], 
    }]
grid_search_knn = GridSearchCV(knn_clf, param_grid, cv = 3, scoring = "accuracy", n_jobs = -3)
grid_search_knn.fit(X_train, y_train)
best_knn = grid_search_knn.best_estimator_
print(grid_search_knn.best_score_)


0.9650599703714553


In [45]:
#Combine into an Voting Classifier ensemble
from sklearn.ensemble import VotingClassifier
voting_clf = VotingClassifier(
    estimators = [
     ('rf', best_rf),
     ('et', best_et),
     ('svc', svc_clf),
     ('knn', best_knn)   
    ]
) 

voting_clf.fit(X_train, y_train)
voting_clf.voting = "hard"

In [None]:
classifiers = [best_rf, best_et, svc_clf, best_knn, voting_clf]
for classifier in classifiers:
    print(classifier.score(X_validation, y_validation))

0.9745
0.9764
0.9802
0.9712
0.9803


As we can see, the voting classifier ensemble produces a result equivalent to svc. Lets keep it and try on the test set.

In [47]:
classifiers = [best_rf, best_et, svc_clf, best_knn, voting_clf]
for classifier in classifiers:
    print(classifier.score(X_test, y_test))

0.9704
0.9748
0.9785
0.9666
0.978


### 9) Run the individual classifiers from the previous exercise to make predictions on the validation set, and create a new training set with the resulting predictions: each training instance is a vector contating the set of predictions from all your classifiers for an image, and the target is the image's class. Train a classifier on this new training set. Congratulations - you have just trained a blender, and together with the classifiers it forms a stacking ensemble! Now evaluate the ensemble on the test set. For each image in the test set, make predictions with all our classifiers, then feed the predicions to the blender to get the ensemble's predictions. How does it compare to the voting classifier you trained earlier? Now try again using a *StackingClassifier* instead. Do you get better performance? If so, why?

In [55]:
#Make prediction for each predictor for the validation set
X_blender = np.zeros((X_validation.shape[0], len(classifiers)))
for i, classifier in enumerate(classifiers):
    X_blender[:,i] = classifier.predict(X_validation)


In [75]:
#Train a classifier with the others predictions
blend_rf = RandomForestClassifier(random_state = 42)
blend_rf.fit(X_blender, y_validation)

#Choose some hyperparameters by Grid Search
param_grid = [{'n_estimators':[50, 100, 200], 'max_features':[5, 10, 50, 100, 'sqrt']}]
blend_rf_cv = GridSearchCV(blend_rf, param_grid, cv = 3, n_jobs = -3)
blend_rf_cv.fit(X_blender, y_validation)
best_blend_rf = blend_rf_cv.best_estimator_

In [70]:
###Evaluate on the test set
#Create new X test set as the individual classifiers predictions for each predictor
X_blender_test = np.zeros((X_test.shape[0], len(classifiers)))
for i, classifier in enumerate(classifiers):
    X_blender_test[:, i] = classifier.predict(X_test)

In [76]:
#Predict y values from blender
print(best_blend_rf.score(X_blender_test, y_test))

0.9782


The blender generated a result slightly better than the voting classfier, but still slightly worse than SVC. Now lets use  a *StackingClassifier*

In [80]:
from sklearn.ensemble import StackingClassifier
stacking_clf = StackingClassifier(
    estimators = [
        ('rf', best_rf),
        ('et', best_et),
        ('knn', best_knn),
        ('svc_clf', svc_clf),
        ('voting', voting_clf)
    ]
)

stacking_clf.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [81]:
stacking_clf.score(X_test, y_test)

0.9816

The result was considerably better, even though the algorithm did not converge