# **Workshop VIII** <br/> *Regression Tree, Random Forest, Bagging and Boosting*

This notebook aims to provide a practical overview of regression trees, random forests, as well as bagging and boosting. After this workshop, the student should be able to know:
* how to apply regression trees, random forests, or a bagging / boosting approach
* when to apply regression trees, random forests, or a bagging / boosting approach

In [None]:
# Imports
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
from sklearn.datasets import load_diabetes

## 1. Data Preprocessing and Exploration

### a. Load your dataset

In [None]:
diabetes = load_diabetes()
df = pd.DataFrame(data= np.c_[diabetes['data'], diabetes['target']], columns=diabetes['feature_names'] + ['target'])

df.head()

### b. [OPTIONAL] Preprocess the data

Depending on your dataset, you can perform different techniques of data preprocessing such as dropping NaN values, create dummy variables or standardize your data.

In [None]:
# TODO: Preprocess and clean your data (if necessary)

### c. Split the dataset into features (independent variables) and target variable (dependent variable) 

In [None]:
X = ... # TODO: Take features
y = ... # TODO: Take variable

print(f"X shape: {X.shape}, y shape: {y.shape}")

### d. Split the dataset into training and testing sets

In [None]:
test_size = ... # TODO: Define the percentage of the test size after train-test split
random_state = ... # TODO: Define random seed for reproducibility

X_train, X_test, y_train, y_test = ... # TODO: Use train_test_split method
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")

## 2. Regression Tree

In this exercise, we will use sklearn.tree.DecisionTreeRegressor.

If you want to find more information about the hyperparameters that this method uses, please visit: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html

### Define and train the model

In [None]:
# TODO: Define an Regression Tree object using DecisionTreeRegression 
regression_tree = ... # you can experiment with any hyperparameters

# TODO: Fit the model


### Make predictions and evaluate the model

In [None]:
# TODO: Make predictions
y_pred_cart = ...

In [None]:
# TODO: Evaluate the model using MSE
mse = ...
print(f'Mean Squared Error: {mse}')

### Examples of visualization techniques for qualitative interpretations and analysis

In [None]:
# Visualize the regression tree using plot_tree
plt.figure(figsize=(15, 10))
plot_tree(regression_tree, feature_names=list(X.columns), filled=True, rounded=True, proportion=True, precision=2)
plt.title("Regression Tree Visualization")
plt.show()

In [None]:
# Plotting predictions vs actual values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_cart, color='blue', label='Predictions vs Actual')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], linestyle='--', color='red', linewidth=2, label='Ideal Line')
plt.title('Predictions vs Actual Values')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.legend()
plt.show()

## 3. Random Forest

In this exercise, we will use the class sklearn.ensemble.RandomForestRegressor.

If you want to find more information about the hyperparameters that this method uses, please visit: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

### Define and train the model

In [None]:
# TODO: Define an Random Forest object using RandomForestRegressor 
rf_model = ... # you can experiment with any hyperparameters

# TODO: Fit the model


### Make predictions and evaluate the model

In [None]:
# TODO: Make predictions
y_pred_rf = ...

In [None]:
# TODO: Evaluate the model using MSE
mse = ...
print(f'Mean Squared Error: {mse}')

## 4. Understand the parameters of random forest

### Feature importantces

In [None]:
# Get feature importances from the random forest model
feature_importances = rf_model.feature_importances_

# Get the names of features
feature_names = list(X.columns)

# Sort features based on importance
sorted_idx = feature_importances.argsort()

# Plot feature importance
plt.figure(figsize=(10, 6))
plt.barh(range(len(sorted_idx)), feature_importances[sorted_idx], align="center")
plt.yticks(range(len(sorted_idx)), [feature_names[i] for i in sorted_idx])
plt.xlabel("Feature Importance")
plt.title("Random Forest Feature Importance")
plt.show()

What does this figure mean? Which feature is the most important?

#### Visualization

In [None]:
# Plotting predictions vs actual values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_rf, color='green', label='Predictions vs Actual (Random Forest)')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], linestyle='--', color='red', linewidth=2, label='Ideal Line')
plt.title('Random Forest Predictions vs Actual Values')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.legend()
plt.show()

What if we use fewer or more estimators in random forest? Plot the curve to see how the MSE change for different value of n_estimators.

In [None]:
nList = [3,5,10,50,100,200,300]
mseList = []
for nEst in nList:
    pass
    # TODO: change the number of estimator

plt.plot(nList,mseList,"-o")
plt.title('MSE vs. Number of estimator in random forest')
plt.show()

In the following part, we are going to cover two fundamental ensemble learning techniques: **Bagging** and **Boosting**

## 5. Bagging - Training and Evaluating a Model of Your Choice
The function sklearn.ensemble.BaggingRegressor can be utilized to perform the bagging on your own algorithms.

Try to implement it in a way that the base estimator is the linear regression model.

If you want to find more information about the hyperparameters that this method uses, please visit: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html

In [None]:
from sklearn.ensemble import BaggingRegressor
from sklearn.linear_model import LinearRegression

# TODO: Defining and fitting the model
bagging_model = ...
... # for fitting the model

# TODO: Make predictions
y_pred_bagging = ...

# TODO: Evaluate the model using MSE
mse_bagging = ...
print(f'Mean Squared Error (Bagging): {mse_bagging}')

In [None]:
# Plotting predictions vs actual values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_bagging, color='green', label='Predictions vs Actual (Bagging)')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], linestyle='--', color='red', linewidth=2, label='Ideal Line')
plt.title('Predictions vs Actual Values')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.legend()
plt.show()

## 6. Boosting

#### AdaBoost Regressor

In [None]:
from sklearn.ensemble import AdaBoostRegressor

# TODO: Defining and fitting the model
adaboost_model = ...
... # for fitting the model

# TODO: Make predictions
y_pred_adaboost = adaboost_model.predict(X_test)

# TODO: Evaluate the model
mse_adaboost = ...
print(f'Mean Squared Error (AdaBoost): {mse_adaboost}')

In [None]:
# Plotting predictions vs actual values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_adaboost, color='green', label='Predictions vs Actual (AdaBoost)')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], linestyle='--', color='red', linewidth=2, label='Ideal Line')
plt.title('Predictions vs Actual Values')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.legend()
plt.show()

#### XGBoost Regressor

In [None]:
You may need to install xgboost first:

$ pip install xgboost

In [None]:
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error

# Defining and fitting the model
xgb_model = ...
... # for fitting the model

# Make predictions
y_pred_xgb = xgb_model.predict(X_test)

# Evaluate the model
mse_xgb = ...
print(f'Mean Squared Error (XGBoost): {mse_xgb}')

In [None]:
# Plotting predictions vs actual values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_xgb, color='green', label='Predictions vs Actual (XGBoost)')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], linestyle='--', color='red', linewidth=2, label='Ideal Line')
plt.title('Predictions vs Actual Values')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.legend()
plt.show()

## 7. [OPTIONAL] Try XGBoost or random forest in classification task.
Do you think the above algorithms will work for classification tasks? Next, we try to use them for classification tasks.

For simplicity we discretize our y value into two classes.

In [None]:
from sklearn.preprocessing import KBinsDiscretizer

# Reshape y_train to use with KBinsDiscretizer (it expects 2D array)
y_test_reshaped = y_test.to_numpy().reshape(-1, 1)

# Initialize the discretizer
discretizer = KBinsDiscretizer(n_bins=2, encode='ordinal', strategy='kmeans')

# Fit the discretizer on the training data
discretizer.fit(y_test_reshaped)

# Now, transform y_train and y_test using the fitted discretizer
y_test_discrete = discretizer.transform(y_test_reshaped).astype(int)

Another way to split a continous variable by the median:

In [None]:
def discretize_data_median(y_test):
    # Calculate the median
    median = y_test.median()

    y_test_discrete = (y_test > median).astype(int)
    return y_test_discrete

# Usage:
y_test_distrete = discretize_data_median(y_test)

Try to use XGBoost to predict their classes.

In [None]:
from xgboost import XGBClassifier

# Defining and fitting the model
xgbcls_model = ...

# Fit the model
...

# Make predictions
y_pred_xgbcls = xgbcls_model.predict(X_test)

# Evaluate the model
accuracy_xgbcls = ...
print(f'Accuracy (XGBoost): {accuracy_xgbcls}')

Try to use RandomForest to predict their classes.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Define a Random Forest object using RandomForestClassifier
rfcls_model = ...

# Fit the model
...

# Make predictions
y_pred_rfcls = rfcls_model.predict(X_test)

# Evaluate the model
accuracy_rfcls = ...
print(f'Accuracy: {accuracy_rfcls}')

## 8. Interpretation of your results

Use this space to analyse the performance of your trained models throughout this workshop. Some examples of ideas include:
* discussing about the influence of some hyperparameters (e.g.: max_depth for Decision Tree, n_estimators for Random Forest, etc.)
* comparing the results you obtained using plots or tables


_Write your answer here