# Chapter 7: Ensemble Lerarning

---

## Introduction
Ensemble learning is a machine learning technique that combines predictions from multiple models to improve overall performance which is called ***The Wisdom of the crowd***. By leveraging the strengths of diverse models, it reduces variance, bias, or improves predictions. Common ensemble methods include bagging, boosting, and stacking.

---

## 7.1 Vooting Classifiers

- 🧠 Training Multiple Classifiers  
Suppose you train several classifiers like Logistic Regression, SVM, Random Forest, and KNN, each with ~80% accuracy.  
These diverse models form the foundation for ensemble learning.


- 📉 The Role of Independence  
For ensembles to excel, classifiers must make uncorrelated errors.  
Training on the same data may reduce independence and performance.

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Applying the method on moons dataset
from sklearn.ensemble import VotingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons
from sklearn.metrics import classification_report, accuracy_score


- 🔁 Weak Learners Become Strong  
Even weak learners (slightly better than random) can form a strong learner if they are diverse and numerous.  
This is the power of collective intelligence in ensembles.

In [3]:
# Loading data
X_moons, y_moons = make_moons(n_samples=500, noise=0.3, random_state=42)

# Spliting the data
X_train, X_test, y_train, y_test = train_test_split(X_moons, y_moons, test_size=0.2, random_state=42)

# Defining the model
voting_clf = VotingClassifier(
    estimators=[
        ('log_reg', LogisticRegression(random_state=42)),
        ('svm_clf', SVC(random_state=42)),
        ('rf_clf', RandomForestClassifier(random_state=42))
    ]
)

voting_clf.fit(X_train, y_train)

In [4]:
# Predictions
y_train_preds = voting_clf.predict(X_train)
y_test_preds = voting_clf.predict(X_test)

In [5]:
# Overall Performance
print("Trian Accuaracy: ", accuracy_score(y_train, y_train_preds))
print("Test Accuaracy: ", accuracy_score(y_test, y_test_preds))
print(classification_report(y_test, y_test_preds))

Trian Accuaracy:  0.93
Test Accuaracy:  0.87
              precision    recall  f1-score   support

           0       0.77      1.00      0.87        43
           1       1.00      0.77      0.87        57

    accuracy                           0.87       100
   macro avg       0.88      0.89      0.87       100
weighted avg       0.90      0.87      0.87       100



In [6]:
# Every Classifier Performance
for name, clf in voting_clf.named_estimators.items():
  clf.fit(X_train, y_train)
  print(name, " = ", clf.score(X_test, y_test))

log_reg  =  0.85
svm_clf  =  0.87
rf_clf  =  0.88


**A one thing that worth mentioning is that voting classifier applies hard voting by default**

- 🗳️ Hard Voting Classifier  
By aggregating their predictions and selecting the majority vote, you create a *hard voting classifier*.  
This simple ensemble often outperforms individual classifiers.

In [7]:
# So why not trying soft one
voting_clf.voting = 'soft'
voting_clf.named_estimators['svm_clf'].probability = True

In [8]:
# fitting
voting_clf.fit(X_train, y_train)

# Eval
voting_clf.score(X_test, y_test)

0.89

**Here we should say that soft generally achieves more accuracy than hard**

---

## 7.2 Bagging & Pasting

- 🔄 Bagging & Pasting  
Both use random subsets to train multiple models.  
Bagging = with replacement; Pasting = without replacement.

- 📊 Prediction Aggregation  
Combine model outputs via majority vote (classification) or averaging (regression).  
Reduces variance while maintaining bias.


In [9]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bagging_clf = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=100,
    n_jobs=-1,
    random_state=42
)

pasting_clf = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=100,
    bootstrap=False,
    n_jobs=-1,
    random_state=42
)

- 🚀 Fast & Parallel  
Training and prediction can run in parallel *(n_jobs=-1)*, making these methods highly scalable.

In [10]:
bagging_clf.fit(X_train, y_train)
pasting_clf.fit(X_train, y_train)

**There is no much difference but this due to small dataset but:**
- 🎯 Bias vs. Variance  
Bagging adds bias but lowers variance via model diversity.  
Usually more effective than pasting.

In [11]:
print("Bagging Performance: ", bagging_clf.score(X_test, y_test))
print("Pasting Performance: ", pasting_clf.score(X_test, y_test))

Bagging Performance:  0.9
Pasting Performance:  0.91


---

### Out-of-Bag Evaluation

**🌱 Bagging & Out-of-Bag (OOB) Evaluation**
- Bagging randomly samples training data with replacement—about 63% of data is used per predictor, leaving ~37% as out-of-bag (OOB) instances.  
- These OOB instances act like a built-in validation set, allowing performance to be estimated without extra data.  
In Scikit-Learn, setting `oob_score=True` enables this automatic evaluation, with results available via `oob_score_`.

In [12]:
bagging_clf = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=500,
    oob_score=True,
    n_jobs=-1,
    random_state=42
)

bagging_clf.fit(X_train, y_train)

In [13]:
print("OOB Score is: ", bagging_clf.oob_score_)

OOB Score is:  0.91


In [14]:
# Now we can use a real metric to measure the accuracy
print("Accuracy is: ", accuracy_score(y_test, bagging_clf.predict(X_test)))

Accuracy is:  0.88


**📊 Accuracy & Decision Function**  
- OOB accuracy is a strong estimate of test accuracy—e.g., 90.1% OOB vs. 91.2% test score in one example.  
- The `oob_decision_function_` gives class probabilities for each instance, offering insight into prediction confidence.  
- This makes OOB evaluation both efficient and informative in ensemble learning.


In [15]:
bagging_clf.oob_decision_function_[:3]

array([[1.        , 0.        ],
       [0.        , 1.        ],
       [0.03157895, 0.96842105]])

---

## 7.3 Random Forests

In [16]:
# scikit-learn RF classifier
rf_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1, random_state=42)

# Here is an equivalent implementation of RF using bagging
bagging_rf_clf = BaggingClassifier(
    estimator=DecisionTreeClassifier(max_features='sqrt', max_leaf_nodes=16),
    n_estimators=500,
    n_jobs=-1,
    random_state=42
)

rf_clf.fit(X_train, y_train)
bagging_rf_clf.fit(X_train, y_train)

In [17]:
print("Bagging Performance: ", accuracy_score(y_test, bagging_rf_clf.predict(X_test)))
print("RF Performance: ", accuracy_score(y_test, rf_clf.predict(X_test)))

Bagging Performance:  0.89
RF Performance:  0.88


**Almost the same**

---

### More Randomness

In [18]:
# For mode randomness in the design we just set the splitter attribute to random
bagging_rf_clf = BaggingClassifier(
    estimator=DecisionTreeClassifier(max_features='sqrt', max_leaf_nodes=16, splitter='random'),
    n_estimators=500,
    n_jobs=-1,
    random_state=42
)

bagging_rf_clf.fit(X_train, y_train)

In [19]:
print("Performance: ", accuracy_score(y_test, bagging_rf_clf.predict(X_test)))

Performance:  0.91


**2% Performance Increase!**

---

### 🌟 Feature Importance  
- 🌲 Random Forests provide a convenient way to measure feature importance by evaluating how much each feature reduces impurity across all trees.  
- 🧠 Scikit-Learn computes these importance scores automatically and scales them to sum to 1. This helps in identifying which features are most relevant for prediction and supports effective feature selection.


In [20]:
from sklearn.datasets import load_iris
iris = load_iris(as_frame=True)

# train
rf_clf = RandomForestClassifier(n_estimators=500, random_state=42)
rf_clf.fit(iris.data, iris.target)

for name, score in zip(iris.feature_names, rf_clf.feature_importances_):
  print(name, ': ', round(score, 3))

sepal length (cm) :  0.112
sepal width (cm) :  0.023
petal length (cm) :  0.441
petal width (cm) :  0.423


---

## 7.4 🚀 Boosting  
- 🔁 Boosting is an ensemble method that combines several weak learners into a stronger one by training them sequentially. Each new model tries to correct the errors of the previous one. The two most common boosting methods are **AdaBoost** and **Gradient Boosting**.


### ⚡ AdaBoost  
- 🎯 AdaBoost improves model performance by giving more weight to misclassified instances during training. It sequentially trains predictors, with each focusing more on difficult cases.  
- ⚠️ However, this process cannot be parallelized easily and may not scale as well as other methods.


In [21]:
from sklearn.ensemble import AdaBoostClassifier

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

ada = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),
    n_estimators=30,
    random_state=42
)

ada.fit(X_train, y_train)

In [22]:
# Eval
print("Evaluation is: ", ada.score(X_test, y_test))

Evaluation is:  0.9666666666666667


**🛠️ AdaBoost Variants**  
- 🧪 Scikit-Learn uses **SAMME** for multiclass classification and **SAMME.R** when predictors can estimate probabilities.  
- 📊 Both variations adapt AdaBoost to work with multiple classes, with **SAMME.R** typically performing better due to its use of probability estimates.

---

### 🌱 Gradient Boosting  
- 📉 Gradient Boosting builds predictors sequentially, each trying to correct the residual errors of the previous ones. Unlike AdaBoost, it does not adjust instance weights but fits new predictors to the residuals.  
- 🏆 It is widely used for both regression and classification tasks.

In [23]:
# gaussein noised data
np.random.seed(42)
X = np.random.rand(100, 1) - 0.5
y = X[:, 0]**2 + 0.05*np.random.randn(100)

In [24]:
# hand-made GBTR
from sklearn.tree import DecisionTreeRegressor

tree1_reg = DecisionTreeRegressor(max_depth=2, random_state=42)
tree2_reg = DecisionTreeRegressor(max_depth=2, random_state=43)
tree3_reg = DecisionTreeRegressor(max_depth=2, random_state=44)

tree1_reg.fit(X, y)
y1 = y - tree1_reg.predict(X)

tree2_reg.fit(X, y1)
y2 = y1 - tree2_reg.predict(X)

tree3_reg.fit(X, y2)

In [25]:
# Final review for boosted regressor
print("Accuracy: ", tree3_reg.score(X, tree3_reg.predict(X)))

Accuracy:  1.0


## 🌄 Histogram-based Gradient Boosting (HistGradientBoosting)
- Histogram-based Gradient Boosting is a highly efficient variant of traditional Gradient Boosting, introduced in Scikit-learn to speed up training on large datasets.
- It works by binning continuous features into discrete bins (histograms), which reduces the number of split points to evaluate and improves training speed without significantly compromising performance.


**✅ Advantages**
- Faster training by feature binning.
- Low memory usage.
- Supports missing values natively.
- Scales better on large datasets compared to regular Gradient Boosting.~

In [26]:
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Initialize and train model
model = HistGradientBoostingClassifier()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")


Accuracy: 1.0


---

## 🧠 Concept
- Stacking combines multiple models (first layer) and uses a meta-learner (blender) to learn how to best combine their predictions.

In [27]:
# ! pip install deslib

### ⚙️ Training Process
- Train base models on one subset, then use their predictions on a hold-out set to train the meta-learner on a new dataset.

In [28]:
from sklearn.datasets import make_classification
from deslib.des.knora_e import KNORAE

# 1. Generate synthetic data
X, y = make_classification(n_samples=1000, n_features=20,
                           n_informative=15, n_redundant=5,
                           n_classes=2, random_state=42)

# 2. Split into train, DSEL (Dynamic Selection Set), and test
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.5, random_state=42)
X_dsel, X_test, y_dsel, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# 3. Train a pool of classifiers
pool_classifiers = RandomForestClassifier(n_estimators=10, random_state=42)
pool_classifiers.fit(X_train, y_train)

# 4. Initialize and fit the KNORAE dynamic ensemble model
knorae = KNORAE(pool_classifiers=pool_classifiers)
knorae.fit(X_dsel, y_dsel)

# 5. Predict and evaluate
y_pred = knorae.predict(X_test)
acc = accuracy_score(y_test, y_pred)

print(f"KNORAE Accuracy: {acc}")

KNORAE Accuracy: 0.784




🧱 Multi-Layer Extensions
- You can stack multiple layers of models by splitting data into more subsets, training each layer sequentially on the predictions from the one before.

In [29]:
from sklearn.ensemble import StackingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

stack_clf = StackingClassifier(
    estimators=[
        ('lpg_clf', LogisticRegression(random_state=42)),
        ('rf_clf', RandomForestClassifier(random_state=42)),
        ('svm_clf', SVC(probability=True, random_state=42))
    ],
    final_estimator=RandomForestClassifier(random_state=43),
    cv=5
)

stack_clf.fit(X_train, y_train)

In [30]:
print("Stack model accuracy: ", stack_clf.score(X_test, y_test))

Stack model accuracy:  0.904


---

## Conclusion

The Chapter focused on ensemble methods, which improve model performance by combining multiple models. It explains voting classifiers, bagging (like Random Forests), boosting (like AdaBoost and Gradient Boosting), and stacking. These techniques reduce errors by averaging or correcting individual models' mistakes, leading to more accurate and robust predictions than single models.

---

## ⚙️ Practical Excersices

### Question 8

In [31]:
from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', as_frame=False)

In [32]:
mnist.data.shape

(70000, 784)

In [33]:
X_train, X_temp, y_train, y_temp = train_test_split(mnist.data, mnist.target, test_size=20000, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

In [34]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.neural_network import MLPClassifier

rf_clf = RandomForestClassifier(random_state=42)
extra_clf = ExtraTreesClassifier(random_state=42)
mlp_clf = MLPClassifier(random_state=42)

In [35]:
rf_clf.fit(X_train, y_train)
extra_clf.fit(X_train, y_train)
mlp_clf.fit(X_train, y_train)

In [37]:
print("Random Forest Performance: ", rf_clf.score(X_val, y_val))
print("Extra Trees Performance: ", extra_clf.score(X_val, y_val))
print("MLP Performance: ", mlp_clf.score(X_val, y_val))

Random Forest Performance:  0.9677
Extra Trees Performance:  0.9689
MLP Performance:  0.96


In [38]:
# Combining all warriors

voting_clf = VotingClassifier(
    estimators=[
        ('rf_clf', RandomForestClassifier(random_state=42)),
        ('extra_clf', ExtraTreesClassifier(random_state=42)),
        ('mlp_clf', MLPClassifier(random_state=42))
    ],
    n_jobs=-1
)

voting_clf.fit(X_train, y_train)

In [39]:
print("Voting Performance on val data: ", voting_clf.score(X_val, y_val))
print("Voting Performance on test data: ", voting_clf.score(X_test, y_test))

Voting Performance on val data:  0.9714
Voting Performance on test data:  0.9708


**Voting classifier of combined methods do give a much better generalization and performance than every one of them alone**

---

### Question 9

In [41]:
rf_preds = rf_clf.predict(X_val)
extra_preds = extra_clf.predict(X_val)
mlp_preds = mlp_clf.predict(X_val)

new_X_val = np.array([rf_preds, extra_preds, mlp_preds]).astype(int).T

In [42]:
blender = RandomForestClassifier(random_state=44)
blender.fit(new_X_val, y_val)

In [43]:
rf_test_preds = rf_clf.predict(X_test)
extra_test_preds = extra_clf.predict(X_test)
mlp_test_preds = mlp_clf.predict(X_test)

new_X_test = np.array([rf_test_preds, extra_test_preds, mlp_test_preds]).astype(int).T

In [44]:
print('Blender Performance: ', blender.score(new_X_test, y_test))

Blender Performance:  0.968


**Blender performance and Voting performance almost the same**

In [45]:
# Stacking Classifier

stack_clf = StackingClassifier(
    estimators=[
        ('rf_clf', RandomForestClassifier(random_state=42)),
        ('extra_clf', ExtraTreesClassifier(random_state=42)),
        ('mlp_clf', MLPClassifier(random_state=42))
    ],
    final_estimator=RandomForestClassifier(random_state=44),
    cv=5
)

stack_clf.fit(X_train, y_train)

In [46]:
print("Stacking Performance on val data: ", stack_clf.score(X_val, y_val))
print("Stacking Performance on test data: ", stack_clf.score(X_test, y_test))

Stacking Performance on val data:  0.9768
Stacking Performance on test data:  0.9764
