## Examples: Implementing an ensemble method using `scikit-learn`
Let's implement a basic ensemble method using the `scikit-learn` library. We'll use the dataset provided to predict the `BiodiversityHealthIndex` using both a single model and an ensemble method for comparison. For the ensemble, we'll use a `RandomForestRegressor`, a popular bagging method.

In [1]:
from sklearn.model_selection import train_test_split
import pandas as pd

# Load the dataset
data = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/SDG_15_Life_on_Land_Dataset.csv')

# Define features and target
X = data.drop('BiodiversityHealthIndex', axis=1)
y = data['BiodiversityHealthIndex']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

 Building individual models

Let's first build a simple decision tree regressor as our weak learner for comparison.

In [2]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Initialise and train the decision tree
tree_model = DecisionTreeRegressor(random_state=42, max_depth=3)
tree_model.fit(X_train, y_train)

# Predict and evaluate
tree_predictions = tree_model.predict(X_test)
tree_mse = mean_squared_error(y_test, tree_predictions)
print(f"Decision Tree MSE: {tree_mse}")


Decision Tree MSE: 0.08696874034772016


 Building an ensemble model

Now, let's use the `RandomForestRegressor` as our ensemble method.

In [3]:
from sklearn.ensemble import RandomForestRegressor

# Initialise and train the random forest
forest_model = RandomForestRegressor(n_estimators=100, random_state=42, max_depth=3)
forest_model.fit(X_train, y_train)

# Predict and evaluate
forest_predictions = forest_model.predict(X_test)
forest_mse = mean_squared_error(y_test, forest_predictions)
print(f"Random Forest MSE: {forest_mse}")

Random Forest MSE: 0.0858280816891359


## Conclusion
By comparing the mean squared error (MSE) of the decision tree model with that of the random forest, we can observe the impact of using ensemble methods. Note that we used a `max_depth=3` parameter in both the decision tree and random forest models, to ensure we're seeing the impact of using ensembles. Typically, the random forest (an ensemble method) should outperform the single decision tree due to its ability to reduce overfitting and variance in predictions. Limiting the max_depth parameter prevents overfitting the data, which could lead to individual trees being rather complex and elaborate.