# Exercise 02 Ensemble Learning II

## Pedagogy

This notebook contains both theoretical explanations and executable cells to execute your code.

When you see the <span style="color:red">**[TBC]**</span> (To Be Completed) sign, it means that you need to perform an action else besides executing the cells of code that already exist. These actions can be:
- Complete the code with proper comments
- Respond to a question
- Write an analysis
- etc.

### Import libraries

In [None]:
# import all libraries used in this notebook here
import warnings

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.svm import SVC, SVR
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.metrics import classification_report, f1_score, mean_squared_error, r2_score
from sklearn.pipeline import Pipeline
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, StackingClassifier
from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor, StackingRegressor
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

In [None]:
# suppress all warnings
warnings.filterwarnings("ignore")

## Part 1. AdaBoost Classifier

In this part, we will using AdaBoost algorithm to build a classifier with a toy dataset.

We will ececute the following steps:

- Load and explore dataset
- Train test split
- Build an AdaBoost classifier with default hyper-parameters
- Evaluation the classifier using the test dataset
- Obtain the byproduct feature importance
- Test the effects of hyper-parameters on performance

### 1.1 Load dataset

We will use a toy dataset, the [wine recognition dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#wine-dataset), provided by `scikit-learn`.

There are 13 feature variables in the dataset, which are the results of a chemical analysis of wines grown in the same region in Italy by three different cultivators.

There are one target variable, with 3 unique categories, representing the 3 different cultivators.

There are 178 instances in the dataset.

We will use this dataset to build a multi-class classifier with AdaBoost.

Use `sklearn.dataset.load_wine()` to get this dataset.

In [None]:
# load dataset
feature_df, target_df = datasets.load_wine(
    return_X_y = True, # If True, returns (data.data, data.target) instead of a Bunch object.
    as_frame = True # If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric).
)

In [None]:
# display the first five rows of the features
feature_df.head()

In [None]:
# display the unique values of the target variable
target_df.unique()

### 1.2 Train test split

We will split the whole dataset into two parts: the training and test dataset.
- 70% for training
- 30% for test

Use `sklearn.model_selection.train_test_split()` to do this.

In [None]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(
    feature_df.values, # call `.values` to convert the feature from pd.DataFrame to np.array
    target_df.values, # ca;; `.values` to convert the target from pd.Series to np.array
    train_size = 0.7, # 70% for training, 30% for test
    random_state = 0 # controls the shuffling, set to zero for reproduciblillity
)

### 1.3 Build a classifier using AdaBoost

Build a multi-class classifier using AdaBoost with default hyper-parameters.

- `estimator = DecisionTreeClassifier(max_depth = 1)`
- `n_estimators = 50`
- `learning_rate = 1.0`

In [None]:
# create the AdaBoost classifier with default hyper-parameters
clf = AdaBoostClassifier(
    random_state = 0 # set random state to 0 for reproduciblity
)

In [None]:
# fit the model to the training dataset
clf.fit(X_train, y_train)

### 1.4 Evaluation using the test dataset

Evaluate the performance of the classifier using the test dataset.

Use `sklearn.metrics.classification_report()` to get the evaluation metrics.

In [None]:
# predict categories for test dataset
y_pred = clf.predict(X_test)

In [None]:
# obtain classification metrics using `classification_report`
print(classification_report(y_test, y_pred))

### 1.6 Obtain feature importances

Feature importance is a byproduct provided by AdaBoost algorithm when the weak learner is tree-based.

The feature importance is the weighted average of feature importance across all trees in the ensemble.

The weights here refer to the weights of different learners, which are the same as the weights to aggregate the base predictions into the final prediction.

In [None]:
# obtain feature importances
feature_importances = pd.Series(
    data = clf.feature_importances_,
    index = feature_df.columns
)
feature_importances

We can find, unlike random forests that most of the features have importance greater than 0, the feature importance from AdaBoost has a lot of 0 values.

This is because the default weak learner in AdaBoost is a decision stump (a decision tree with depth = 1).

It means, each weak learner only uses one feature. And there are a lot of features that are never used.

So AdaBoost only use a limited number of features to achieve good performance. When the number of features is limited, consider to use AdaBoost.

### 1.7 Test with other hyper-parameter values

The three key hyper-parameters will affect the performance of the ensemble.

Let's test these effects by varying one hyper-parameter and keeping the rest unchanged.

In [None]:
# test the effect of n_estimators on performance
n_estimators = [2, 5, 10, 25, 50, 75, 100]
f1_weighted = []
for item in n_estimators:
    clf = AdaBoostClassifier(
        n_estimators = item,
        random_state = 0
    )
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    f1_weighted.append(f1_score(y_test, y_pred, average = 'weighted'))

plt.figure()
plt.plot(n_estimators, f1_weighted, '.-')
plt.xlabel('n_estimators')
plt.ylabel('f1_weighted')
plt.show()

- If `n_estimators` is too low
    - There won't be enough weak learners in the ensemble to correct the high bias.
    - The ensemble might be to simple and underfit the data.
- If `n_estimators` is too high
    - Increase `n_estimator` have a diminishing effect on performance
    - Too many weak learners might lead to over-fitting, which down-grades the performance
    - More estimators increase the training time and require more computational resource
- We can select an optimal value of `n_estimators` through hyper-parameter tuning

In [None]:
# test the effect of learning_rate on performance
learning_rate = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
f1_weighted = []
for item in learning_rate:
    clf = AdaBoostClassifier(
        learning_rate = item,
        random_state = 0
    )
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    f1_weighted.append(f1_score(y_test, y_pred, average = 'weighted'))

plt.figure()
plt.plot(learning_rate, f1_weighted, '.-')
plt.xlabel('learning_rate')
plt.ylabel('f1_weighted')
plt.show()

- If `learning_rate` is too low
    - A low `learning_rate` makes the ensemble more conservative, requiring more estimators to achieve good performance.
    - In general, a low `learning_rate` with enough weak learners can lead to a robust ensemble with good performance.
- If `learning_rate` is too high
    - A high `learning_rate` allows each weak learner to contribute more to the final ensemble.
    - This can speed up training process and reduce the number of required weak learners.
    - But it also increase the risk of over-fitting, which might not be able to obtain the best performance.
- We can make the trade-off between `learning_rate` and `n_estimators` through hyper-parameter tuning

In [None]:
# test the effect of the complexity of base estimators on performance
max_depth = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
f1_weighted = []
for item in max_depth:
    clf = AdaBoostClassifier(
        estimator = DecisionTreeClassifier(max_depth = item),
        random_state = 0
    )
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    f1_weighted.append(f1_score(y_test, y_pred, average = 'weighted'))

plt.figure()
plt.plot(max_depth, f1_weighted, '.-')
plt.xlabel('max_depth')
plt.ylabel('f1_weighted')
plt.show()

- If `max_depth` of the base tree is too low
    - The base tree might be too simple, and the resulting ensemble is also too simple for complex problems
- If `max_depth` of the base tree is too high
    - The base tree is too complex with low bias and high variance
    - Cannot meet the prerequistes of Boosting method
    - Might be over-fitted and the performance can be down-graded
- We should select a low `max_depth` to keep the base tree shallow

## Part 2. Gradient Boosting Classifier

In this part, we will using Gradient Boosting algorithm to solve the same problem.

We will ececute the following steps:

- Build a Gradient Boosting classifier with default hyper-parameters
    - Implement early stopping or not
- Obtain the byproduct feature importance
- Test the effects of hyper-parameters on performance
- Hyper-parameter tuning through cross-validation

### 2.1 Build and evaluate a classifier using Gradient Boosting

Build and evaluate a multi-class classifier using Gradient Boosting with default hyper-parameters.

- `learning_rate = 0.1`
- `n_estimators = 100`
- `max_depth = 3`

In [None]:
# create the AdaBoost classifier with default hyper-parameters
clf = GradientBoostingClassifier(
    n_iter_no_change = 5, # set to None to unable early stopping
    random_state = 0, # set random state to 0 for reproduciblity
    verbose = 3 # set the verbose level for printing progress and performance
)

In [None]:
# fit the model to the training dataset
clf.fit(X_train, y_train)

In [None]:
# predict categories for test dataset
y_pred = clf.predict(X_test)

In [None]:
# obtain classification metrics using `classification_report`
print(classification_report(y_test, y_pred))

### 2.2 Obtain feature importances

As `GradientBoostingClassifier()` is also tree-based, we can obtain the feature importance as a by product.

In [None]:
# obtain feature importances
feature_importances = pd.Series(
    data = clf.feature_importances_,
    index = feature_df.columns
)
feature_importances

We can see there is no zero value in the feature importance.

This is because the default depth of the base tree in `GradientBoostingClassifier()` is 3, thus more features involved than using the decision stump.

### 2.3 Test with other hyper-parameter values

The three key hyper-parameters will affect the performance of the ensemble.

Let's test these effects by varying one hyper-parameter and keeping the rest unchanged.

In [None]:
# test the effect of n_estimators on performance
n_estimators = [2, 5, 10, 25, 50, 75, 100]
f1_weighted = []
for item in n_estimators:
    clf = GradientBoostingClassifier(
        n_estimators = item,
        random_state = 0
    )
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    f1_weighted.append(f1_score(y_test, y_pred, average = 'weighted'))

plt.figure()
plt.plot(n_estimators, f1_weighted, '.-')
plt.xlabel('n_estimators')
plt.ylabel('f1_weighted')
plt.show()

- If `n_estimators` is too low
    - There won't be enough weak learners in the ensemble to correct the remaining residuals.
    - The ensemble might be to simple and underfit the data.
- If `n_estimators` is too high
    - Too many weak learners might lead to over-fitting, which down-grades the performance
    - We can use early stopping to prevent the ensemble from over-fitting
- We can select an optimal value of `n_estimators` through hyper-parameter tuning
- Or we can set `n_estimators` to a large value and adopt early stopping at the same time

In [None]:
# test the effect of learning_rate on performance
learning_rate = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
f1_weighted = []
for item in learning_rate:
    clf = GradientBoostingClassifier(
        learning_rate = item,
        random_state = 0
    )
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    f1_weighted.append(f1_score(y_test, y_pred, average = 'weighted'))

plt.figure()
plt.plot(learning_rate, f1_weighted, '.-')
plt.xlabel('learning_rate')
plt.ylabel('f1_weighted')
plt.show()

- If `learning_rate` is too low
    - A low `learning_rate` makes the ensemble more conservative, requiring more estimators to achieve good performance.
    - In general, a low `learning_rate` with enough weak learners can lead to a robust ensemble with good performance.
- If `learning_rate` is too high
    - A high `learning_rate` allows each weak learner to contribute more to the final ensemble.
    - This can speed up training process and reduce the number of required weak learners.
    - But it also increase the risk of over-fitting, which might not be able to obtain the best performance.
- We can find an optimal `learning_rate` through hyper-parameter tuning

In [None]:
# test the effect of the complexity of base estimators on performance
max_depth = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
f1_weighted = []
for item in max_depth:
    clf = GradientBoostingClassifier(
        max_depth = item,
        random_state = 0
    )
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    f1_weighted.append(f1_score(y_test, y_pred, average = 'weighted'))

plt.figure()
plt.plot(max_depth, f1_weighted, '.-')
plt.xlabel('max_depth')
plt.ylabel('f1_weighted')
plt.show()

- If `max_depth` of the base tree is too low
    - The base tree might be too simple, and the resulting ensemble is also too simple for complex problems
- If `max_depth` of the base tree is too high
    - The base tree is too complex, and the resulting ensemble is also too complex and over-fitted
- Gradient Boosting doesn't require the base tree to be shallow
- We need to select a proper `max_depth` according to the complexity of the problem

### 2.4 Hyper-parameter tuning through cross-validation

Considering the interaction between `n_estimators`, `learning_rate`, `max_depth`, and whether to adopt early stopping or not, we can use hyper-parameter tuning to find the best combination of the hyper-parameter values.

In [None]:
# define the hyper-parameters to search
param_dict = {
    'learning_rate': [1e-2, 1e-1, 1.0, 1e1, 1e2],
    'n_estimators': [10, 50, 100, 200, 500],
    'max_depth': [1, 3, 5, 7, 9, None],
    'n_iter_no_change': [None, 1, 5, 10]
}

In [None]:
# hyper-parameter tuning through cross-validation
grid_clf = GridSearchCV(
    estimator = GradientBoostingClassifier(random_state = 0),
    param_grid = param_dict,
    scoring = 'f1_weighted',
    refit = True,
    cv = 5,
    verbose = 1,
    n_jobs = -1
)
grid_clf.fit(X_train, y_train)

In [None]:
# obtain the best hyper-parameters and the best score
print('Best hyper-parameters:', grid_clf.best_params_)
print('Best score:', grid_clf.best_score_)

In [None]:
# predict categories for test dataset
y_pred = grid_clf.predict(X_test)

In [None]:
# obtain classification metrics using `classification_report`
print(classification_report(y_test, y_pred))

## Part 3. Stacking Classifier

In this part, we will using Stacking algorithm to solve the same problem.

We will ececute the following steps:

- Declare the base estimators
- Train the stacking classifer
- Evaluate the stacking classifer

### 3.1 Declare base estimators

To be noted, the base estimators can be built using different algorithms, or the same algorithm with different hyper-parameters.

For some algorithms that require the specific data pre-processing steps, like feature scaling for KNN and SVM, don't forget to embed the pre-processing steps as a pipeline.

In [None]:
# declare a list of base estimators to be stacked together
estimators = [
    ('decision tree', DecisionTreeClassifier(
        max_depth = 5,
        random_state = 0
    )),
    ('KNN', Pipeline([
        ('standard scaler', StandardScaler()),
        ('knn', KNeighborsClassifier())
    ])),
    ('SVC', Pipeline([
        ('standard scaler', StandardScaler()),
        ('svc', SVC())
    ]))
]

### 3.2 Build the stacked classifier

Here, we use the default algorithm (logistic regression) as the meta learner.

In [None]:
# create the stacked classifier with logistic regression as the final estimator
clf = StackingClassifier(
    estimators = estimators,
    final_estimator = LogisticRegression(),
    n_jobs = -1,
    verbose = 10
)

In [None]:
# fit the model to the training dataset
clf.fit(X_train, y_train)

### 3.3 Evaluation using the test dataset

In [None]:
# predict categories for test dataset
y_pred = clf.predict(X_test)

In [None]:
# obtain classification metrics using `classification_report`
print(classification_report(y_test, y_pred))

## Part 4. Hands-on exercise

<span style="color:red">**This exercise is an assignment, the submission deadline on Learn of this assignment is 25/03/2024 23:59.**</span>

In this exercise, you are required to build a regression model using the three ensemble learning methods we've learned today.

The problem to be solved is predicting the price of flights.

Please download the flight price dataset from Learn.

<span style="color:red">**[TBC]**</span> Please complete the following tasks:

- Load and pre-process the dataset
- Build and evaluate a regression model using:
    - AdaBoost
    - Gradient Boosting
    - Stacking

<span style="color:red">**Warning**</span>: Be aware of the size of the dataset, make sure:

- The scripts are executable on your device (whether your computer or Google Colab)
- The submitted jupyter notebook has been already executed and contains all the outputs.

### Task 1. Load and pre-process the dataset

You need to load the dataset and perform necessary pre-processing steps.

<span style="color:red">**[TBC]**</span> Please complete the following tasks:

- Load the dataset
- Encode categorical features
- Train test split

In [None]:
# [TBC] complete your code here with proper comments
# load the dataset
# hint: pandas.read_csv()


In [None]:
# [TBC] complete your code here with proper comments
# Encode categorical features
# hint: sklearn.preprocessing.LabelEncoder()


In [None]:
# [TBC] complete your code here with proper comments
# Train test split
# hint: sklearn.model_selection.train_test_split()
# hint: first divide the encoded dataset into features and target, then perform train test split


### Task 2. AdaBoost Regressor

You need to build and evaluate a regression model using AdaBoost algorithm.

<span style="color:red">**[TBC]**</span> Please complete the following tasks:

- Hyper-parameter-tuning through cross validation
- Evaluate the performance on test dataset
    - Calculate RMSE and R2 score
    - Visualize the prediction results of the test dataset

In [None]:
# [TBC] complete your code here with proper comments
# AdaBoost Regressor


### Task 3. Gradient Boosting Regressor

You need to build and evaluate a regression model using Gradient Boosting algorithm.

<span style="color:red">**[TBC]**</span> Please complete the following tasks:

- Hyper-parameter-tuning through cross validation
- Evaluate the performance on test dataset
    - Calculate RMSE and R2 score
    - Visualize the prediction results of the test dataset

In [None]:
# [TBC] complete your code here with proper comments
# Gradient Boosting Regressor


### Task 4. Stacking Regressor

You need to build and evaluate a regression model using Stacking algorithm.

<span style="color:red">**[TBC]**</span> Please complete the following tasks:

- Train a stacking regressor on training dataset
    - Select a list of base learner
        - Choose proper regression algorithms
        - If the algorithm needs specific pre-processing steps, embed the steps as a pipeline
    - Select the meta-learner
        - Choose proper regression algorithm
- Evaluate the performance on test dataset
    - Calculate RMSE and R2 score
    - Visualize the prediction results of the test dataset

In [None]:
# [TBC] complete your code here with proper comments
# Stacking Regressor
