## Ensemble Learning and Random Forests Exercises

1. **Combining Models with 95% Precision**
   Yes, it is possible to combine multiple models to potentially achieve better results than any individual model through an ensemble method such as stacking. Stacking works by taking the predictions of each model and using them as input for a final predictor which makes the ultimate decision. This could improve precision if the errors of the individual models are uncorrelated.

2. **Hard vs. Soft Voting Classifiers**
   The difference between hard and soft voting classifiers is in how they aggregate the predictions of the individual learners:
   - *Hard voting* predicts the final class based on the majority vote of the classifiers.
   - *Soft voting* predicts the final class based on the weighted average probability of the class predicted by each classifier. This often achieves higher performance than hard voting because it gives more weight to highly confident votes.

3. **Speeding Up Ensemble Training**
   It is possible to speed up the training of a bagging ensemble by distributing it across multiple servers since each predictor in the ensemble is independent of the others. This is not the case with boosting ensembles, Random Forests, or stacking ensembles, as they typically need to train predictors sequentially, especially boosting which weights subsequent predictors based on the errors of the predecessors.

4. **Benefits of Out-of-Bag Evaluation**
   Out-of-bag (OOB) evaluation allows for an unbiased estimate of the ensemble predictor's performance without the need for a separate validation set. This is possible because in bagging, each predictor is trained on a different random subset of the training data, and the OOB samples are the unused instances which can serve as a test set.

5. **What Makes Extra-Trees More Random than Random Forests**
   Extra-Trees (Extremely Randomized Trees) introduce extra randomness compared to Regular Random Forests in the way splits are made. While Random Forests use a random subset of features to find the best possible thresholds, Extra-Trees make splits based on random thresholds for each feature rather than searching for the best possible thresholds. This extra randomness acts as a form of regularization and can help reduce variance while slightly increasing bias. Extra-Trees are generally faster to train because finding the best threshold for each feature at every split is one of the most time-consuming tasks in training Random Forests.

6. **Hyperparameter Tweaking for AdaBoost Underfitting**
   If an AdaBoost ensemble is underfitting the training data, you might want to:
   - Increase the number of estimators, allowing the model to fit the training data more closely.
   - Reduce the regularization hyperparameters of the base estimator, if applicable, to allow more complex models.
   - Increase the learning rate to put more focus on correcting the errors of the preceding predictors.

7. **Learning Rate Adjustment to Combat Overfitting**
   When a Gradient Boosting ensemble overfits the training data, it is advisable to **decrease the learning rate**. This slows down the learning process and can lead to better generalization by requiring more weak learners to be combined to fit the training data, thus reducing the risk of fitting too closely to the training data noise.



In [1]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, VotingClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 1. Load MNIST dataset
mnist = fetch_openml('mnist_784', version=1)
X, y = mnist["data"], mnist["target"]

# 2. Split data into training, validation, and test sets
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=10000, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=10000, random_state=42)

# 3. Train various classifiers
random_forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
extra_trees_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)
svm_clf = SVC(gamma='scale', probability=True, random_state=42)

estimators = [random_forest_clf, extra_trees_clf, svm_clf]
for estimator in estimators:
    print("Training the", estimator)
    estimator.fit(X_train, y_train)

# 4. Evaluate individual classifiers on the validation set
for estimator in estimators:
    y_pred = estimator.predict(X_val)
    print(estimator.__class__.__name__, accuracy_score(y_val, y_pred))

# 5. Combine them into an ensemble
voting_clf = VotingClassifier(
    estimators=[('rf', random_forest_clf), ('et', extra_trees_clf), ('svc', svm_clf)],
    voting='soft' # or 'hard' for hard voting
)
voting_clf.fit(X_train, y_train)

# 6. Evaluate the ensemble on the validation set
y_pred = voting_clf.predict(X_val)
print("Ensemble accuracy on validation set:", accuracy_score(y_val, y_pred))

# 7. Evaluate the ensemble on the test set
y_pred = voting_clf.predict(X_test)
print("Ensemble accuracy on test set:", accuracy_score(y_test, y_pred))

# Comparison with individual classifiers
for estimator in estimators:
    y_pred = estimator.predict(X_test)
    print(estimator.__class__.__name__, "accuracy on test set:", accuracy_score(y_test, y_pred))


Training the RandomForestClassifier(random_state=42)
Training the ExtraTreesClassifier(random_state=42)
Training the SVC(probability=True, random_state=42)
RandomForestClassifier 0.9692
ExtraTreesClassifier 0.9715
SVC 0.9788
Ensemble accuracy on validation set: 0.9791
Ensemble accuracy on test set: 0.9767
RandomForestClassifier accuracy on test set: 0.9645
ExtraTreesClassifier accuracy on test set: 0.9691
SVC accuracy on test set: 0.976


In [2]:
from sklearn.linear_model import LogisticRegression
import numpy as np

# 1. Create a new training set for the blender
X_val_predictions = np.empty((len(X_val), len(estimators)), dtype=np.float32)

for index, estimator in enumerate(estimators):
    X_val_predictions[:, index] = estimator.predict(X_val)

# 2. Train the blender
blender = LogisticRegression()
blender.fit(X_val_predictions, y_val)

# 3. Evaluate the ensemble on the test set
X_test_predictions = np.empty((len(X_test), len(estimators)), dtype=np.float32)
 
for index, estimator in enumerate(estimators):
    X_test_predictions[:, index] = estimator.predict(X_test)

y_pred = blender.predict(X_test_predictions)
print("Stacking ensemble accuracy on test set:", accuracy_score(y_test, y_pred))

# Compare to the voting classifier's accuracy
voting_clf_accuracy = accuracy_score(y_test, voting_clf.predict(X_test))
print("Voting classifier accuracy on test set:", voting_clf_accuracy)

# Determine if the stacking ensemble outperforms the voting classifier
if voting_clf_accuracy < accuracy_score(y_test, y_pred):
    print("The stacking ensemble outperforms the voting classifier.")
else:
    print("The voting classifier outperforms the stacking ensemble.")


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Stacking ensemble accuracy on test set: 0.9648
Voting classifier accuracy on test set: 0.9767
The voting classifier outperforms the stacking ensemble.
