# Ensemble learning
- Bagging
- Boosting
- Stacking

## Bagging

In [1]:
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

In [2]:
# generate 100 samples, each represented by 4 features
X, y = make_classification(n_samples=1000, n_features=4,
                           n_informative=2, n_redundant=0,
                           random_state=0, shuffle=False)
X.shape

(1000, 4)

In [3]:
# visualize the top-10 samples
X[:10].round(2)

array([[-1.67, -1.3 ,  0.27, -0.6 ],
       [-2.97, -1.09,  0.71,  0.42],
       [-0.6 , -1.37, -3.12,  0.64],
       [-1.07, -1.18, -1.91,  0.66],
       [-1.31, -0.97, -0.15,  1.19],
       [-2.18, -0.97, -0.1 , -0.89],
       [-1.25, -1.13, -0.15,  1.06],
       [-1.35, -1.07,  0.03, -0.11],
       [-1.13, -1.27,  0.74,  0.21],
       [-0.38, -1.09, -0.01,  1.37]])

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, shuffle=True)
X_train.shape, X_test.shape

((670, 4), (330, 4))

- Model performance without bagging

In [5]:
SVC().fit(X_train,y_train).score(X_test, y_test)

0.9363636363636364

- Model performance with bagging

In [6]:
clf = BaggingClassifier(estimator=SVC(), n_estimators=10, random_state=0,
                        max_samples=0.6, max_features=0.8, bootstrap=True)
clf.fit(X_train, y_train)

In [7]:
clf.score(X_test, y_test)

0.9424242424242424

In [8]:
# make prediction for a test sample
clf.predict([[0, 0, 0, 0]])

array([1])

In [9]:
# predicted probability
clf.predict_proba([[0, 0, 0, 0]])

array([[0.4, 0.6]])

In [10]:
clf.classes_

array([0, 1])

In [11]:
clf.estimators_

[SVC(random_state=2087557356),
 SVC(random_state=132990059),
 SVC(random_state=1109697837),
 SVC(random_state=123230084),
 SVC(random_state=633163265),
 SVC(random_state=998640145),
 SVC(random_state=1452413565),
 SVC(random_state=2006313316),
 SVC(random_state=45050103),
 SVC(random_state=395371042)]

In [12]:
for e in clf.estimators_:
    print(e.predict([[0, 0, 0, 0]]))

ValueError: X has 4 features, but SVC is expecting 3 features as input.

**Question:** can you tell what is wrong with the above implementation?

In [None]:
# your answer here:

 Max_features=0.8 is set in the BaggingClassifier. This means that each base estimator (in this case, an SVC) is trained on a random subset of 80% of the features of the full dataset. If the full dataset has 4 features, then each SVC within the BaggingClassifier is trained on 0.8 * 4 = 3.2, which is rounded down to 3 features.

Therefore, when we attempt to make a prediction with 4 features, as shown in the loop where we call e.predict([[0, 0, 0, 0]]) for each estimator e in clf.estimators_, it raises a ValueError because the individual SVC estimators are expecting only 3 features, not 4.

## Boosting

In [13]:
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

In [14]:
X, y = load_iris(return_X_y=True)

- Model performance without boosting

In [15]:
dt = DecisionTreeClassifier(max_depth=1, random_state=0)
dt_scores = cross_val_score(dt, X, y, cv=5)
dt_scores.mean()

0.6666666666666666

- Model performance using boosting

In [16]:
# create a boosting classifier
# please check documentation to understand what does "estimator=None" mean?
clf = AdaBoostClassifier(estimator=None, n_estimators=100, algorithm="SAMME")

In [17]:
scores = cross_val_score(clf, X, y, cv=5)
scores.mean()

0.9533333333333334

**Question:** compare model performance before and after using boosting, refer to our lecture to explain the improvement  

In [18]:
# your answer here:

Boosting is a method in machine learning used to improve the prediction strength of a model by sequentially training weak learners on modified versions of the data. These modifications are designed to focus on the mistakes of the previous learners, thus creating a composite strong learner from the combination of weak learners

The improvement in performance from the non-boosted to the boosted model underscores the effectiveness of boosting. This technique not only reduces the bias that a single weak learner might exhibit but also reduces variance, leading to a model that generalizes better to unseen data. The boost in performance from 0.667 to 0.953 is a clear indicator of the potential benefits of employing boosting in a machine learning workflow.

## Single-layer stacking

In [19]:
from sklearn.linear_model import RidgeCV, LassoCV
from sklearn.neighbors import KNeighborsRegressor

# create multiple individual estimators
base_estimators = [('ridge', RidgeCV()),
              ('lasso', LassoCV(random_state=42)),
              ('knr', KNeighborsRegressor(n_neighbors=20, metric='euclidean'))]

In [20]:
from sklearn.ensemble import GradientBoostingRegressor

# create a final estimator
gb_reg = GradientBoostingRegressor(n_estimators=25, subsample=0.5, min_samples_leaf=25,
                                            max_features=1,random_state=42)

In [21]:
from sklearn.ensemble import StackingRegressor

# stacking the multiple individual estomators and the final estimator
stack_reg = StackingRegressor(estimators=base_estimators, final_estimator=gb_reg)

In [22]:
# prepare train/test data
from sklearn.datasets import load_diabetes
X, y = load_diabetes(return_X_y=True)
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=42)

In [23]:
# fit the stacked regressors
stack_reg.fit(X_train, y_train)

In [24]:
stack_reg.score(X_test, y_test)

0.5267013426135393

**Question:** what does R^2 score mean? What value indicates the model is performing well?

In [None]:
# your answer here:


The R^2 score quantifies how well a model captures the variance in the target variable. A higher R^2 (close to 1) indicates a model that explains more variance and is generally considered to be performing better.

In the given example, the StackingRegressor's R^2 score of approximately 0.526 suggests that the model explains over half of the variance in the dataset, which might be acceptable depending on the specific context and benchmarks of the domain.

## Multi-layer stacking

In [25]:
from sklearn.ensemble import RandomForestRegressor

final_layer_rfr = RandomForestRegressor(n_estimators=10, max_features=1, max_leaf_nodes=5,random_state=42)
final_layer_gbr = GradientBoostingRegressor(n_estimators=10, max_features=1, max_leaf_nodes=5,random_state=42)
final_layer = StackingRegressor(estimators=[('rf', final_layer_rfr),('gbrt', final_layer_gbr)],
                                final_estimator=RidgeCV())

multi_layer_regressor = StackingRegressor(estimators=[('ridge', RidgeCV()),
                                                      ('lasso', LassoCV(random_state=42)),
                                                      ('knr', KNeighborsRegressor(n_neighbors=20,metric='euclidean'))],
                                          final_estimator=final_layer)

multi_layer_regressor.fit(X_train, y_train)