# Ensemble Modelling

## Table of Contents

1. [Introduction](#1.-Introduction)
2. [Ensemble Methodology](#2.-Ensemble-Methodology)

## 1. Introduction

Ensemble modelling is a machine learning technique that combines multiple individual models to improve predictive performance. By aggregating the strengths of diverse models, ensembles often achieve better accuracy and robustness than any single model alone. 

In the context of species distribution modelling (SDM), ensemble approaches integrate predictions from various statistical techniques to enhance the reliability of forecasts. This method accounts for uncertainties inherent in individual models, leading to more robust predictions.

### **Common Ensemble Methods**

1. **Bagging (Bootstrap Aggregating)**: This technique involves training multiple models on different subsets of the data, created through random sampling with replacement. The final prediction is typically an average (for regression) or majority vote (for classification) of the individual models' outputs. 

2. **Boosting**: Boosting sequentially trains models, each focusing on correcting the errors of its predecessor. Models are weighted based on their performance, and the ensemble combines them to produce a strong predictor. 

3. **Stacking**: In stacking, multiple models are trained to predict the same outcome. Their predictions are then used as inputs for a higher-level model, which learns to combine them optimally.

## 2. Ensemble Methodology

### **2.1 Selection of Models for Ensemble**
Based on the previous model evaluation and comparison, Random Forest (RF) and XGBoost consistently outperformed other models, demonstrating the highest AUC-ROC, precision, recall, and F1-score across all species. MaxEnt showed moderate performance, particularly in recall, but had limitations in precision, suggesting a tendency for overprediction. GLM and GAM performed the worst overall, indicating they may not fully capture the complexity of amphibian distributions.

Thus, this study will prioritise RF and XGBoost as the core models in the ensemble and consider MaxEnt for added diversity while downweighting its influence. GLM and GAM may still contribute to the ensemble for additional variance but will not drive final predictions.

### **2.2 Model Weighting and Aggregation Methods**
To integrate multiple models, this study will explore different ensemble techniques:

#### 1. Averaging Ensemble:
- Compute the mean probability of presence across RF, XGBoost, and MaxEnt.
- Weight models according to their precision and recall (e.g., RF and XGBoost given higher weight, MaxEnt downweighted).
#### 2. Majority Voting Ensemble (for binary presence/absence predictions):
- Classify a species as present if at least two out of three models predict presence.
#### 3. Stacked Ensemble (if time allows):
Train a meta-classifier (e.g., logistic regression) using predictions from individual models as inputs.

### **2.3 Calibration and Performance Evaluation**
To ensure the ensemble predictions are robust, the following evaluation metrics will be recalculated:

- AUC-ROC and Precision-Recall curves
- Sensitivity-specificity trade-offs
- Confusion matrix analysis
- Uncertainty quantification (standard deviation in predictions)

The ensemble's performance will be compared to individual models to determine whether it achieves higher predictive accuracy and reliability.

### **2.4 Spatial Mapping of Ensemble Predictions**
Once ensemble predictions are finalised, they will be spatially visualised using GIS tools to assess habitat suitability for target amphibian species. Uncertainty maps will also be generated to highlight regions with high model disagreement.

### **2.5 Methodology Rationale**
This study aims to leverage the advantages of ensemble modelling to provide more accurate, reliable, and ecologically meaningful habitat suitability predictions. The rationale for this approach is:
1. Tree-based models (RF and XGBoost) demonstrate strong performance and capture complex species-environment relationships.
2. MaxEnt contributes additional ecological insightsand has been widely used in SDMs, but its predictions will be weighted lower to account for overprediction tendencies.
3. Averaging and majority voting improve robustness, ensuring predictions are not overly reliant on any single model.
4. Uncertainty quantification will guide conservation decision-making, particularly for identifying regions where predictions are less certain.

By following this approach, the ensemble model will integrate the strengths of individual models, enhance predictive reliability, and contribute valuable insights for amphibian conservation and Blue-Green Infrastructure planning.

## 3. Ensemble Modelling
### 3.1 Load Model Predictions and Prepare for Ensemble

In [2]:
import os
import pandas as pd
import numpy as np

# Define base directory
base_dir = r"C:\GIS_Course\MScThesis-MaviSantarelli\results\Models"

# Define species list
species_list = ["Bufo bufo", "Rana temporaria", "Lissotriton helveticus"]

# Define selected models for ensemble
selected_models = ["RF", "XGBoost", "MaxEnt"]

# Define paths for each model
model_dirs = {
    "RF": os.path.join(base_dir, "RandomForest"),
    "XGBoost": os.path.join(base_dir, "XGBoost"),
    "MaxEnt": os.path.join(base_dir, "Maxent")
}

# Define file patterns for each model
file_patterns = {
    "RF": os.path.join("{species}", "Test_Predictions.csv"),
    "XGBoost": os.path.join("{species}", "Aggregated_Test_Predictions.csv"),
    "MaxEnt": "Maxent_{species}_TestPredictions.csv"
}

# Define output directory for merged predictions
ensemble_output_dir = os.path.join(base_dir, "Ensemble_Predictions")
os.makedirs(ensemble_output_dir, exist_ok=True)

# Iterate over each species
for species in species_list:
    print(f"🔍 Processing ensemble predictions for {species}...")

    merged_df = None  # Initialize dataframe for storing merged predictions

    for model in selected_models:
        formatted_species = species.replace(" ", "_")  # Adjust for file naming
        file_path = os.path.join(model_dirs[model], file_patterns[model].format(species=formatted_species))

        if not os.path.exists(file_path):
            print(f"⚠️ Missing prediction file for {species} - {model}: {file_path}")
            continue  # Skip this model if the file is missing

        # Load model predictions in chunks (if necessary)
        try:
            df = pd.read_csv(file_path, dtype={"True_Label": "int8", "Predicted_Probability": "float32"})
        except Exception as e:
            print(f"⚠️ Error loading {species} - {model}: {e}")
            continue

        # Rename columns
        df = df.rename(columns={"True_Label": "True_Label", "Predicted_Probability": f"{model}_Probability"})

        # Reduce memory usage
        df[f"{model}_Probability"] = df[f"{model}_Probability"].astype(np.float32)

        # Merge into a single dataframe
        if merged_df is None:
            merged_df = df.copy()
        else:
            merged_df = pd.concat([merged_df, df[f"{model}_Probability"]], axis=1)

        del df  # Free up memory

    # Save the merged predictions in chunks
    if merged_df is not None and not merged_df.empty:
        output_file = os.path.join(ensemble_output_dir, f"{species}_Ensemble_Predictions.csv")
        merged_df.to_csv(output_file, index=False)
        print(f"✅ Merged predictions saved: {output_file}")
    else:
        print(f"⚠️ No valid predictions available for {species}.")

    del merged_df  # Free memory after each species

print("\n🚀 Ensemble prediction files ready for next steps!")


🔍 Processing ensemble predictions for Bufo bufo...
✅ Merged predictions saved: C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Ensemble_Predictions\Bufo bufo_Ensemble_Predictions.csv
🔍 Processing ensemble predictions for Rana temporaria...
✅ Merged predictions saved: C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Ensemble_Predictions\Rana temporaria_Ensemble_Predictions.csv
🔍 Processing ensemble predictions for Lissotriton helveticus...
✅ Merged predictions saved: C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Ensemble_Predictions\Lissotriton helveticus_Ensemble_Predictions.csv

🚀 Ensemble prediction files ready for next steps!


### 3.2 Averaging Ensemble Predictions

**Why Use Averaging?**
- This approach reduces individual model biases and leverages the strengths of multiple models.
- Averaging probabilities smooths extreme values, leading to better generalisation.
- It is less prone to overfitting compared to single models.

In [3]:
import os
import pandas as pd
import numpy as np

# Define input and output directories
ensemble_input_dir = r"C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Ensemble_Predictions"
ensemble_output_dir = r"C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Final_Ensemble"
os.makedirs(ensemble_output_dir, exist_ok=True)

# Define species list
species_list = ["Bufo bufo", "Rana temporaria", "Lissotriton helveticus"]

# Iterate through each species
for species in species_list:
    print(f"🔍 Computing averaged ensemble predictions for {species}...")

    # Load ensemble predictions file
    file_path = os.path.join(ensemble_input_dir, f"{species}_Ensemble_Predictions.csv")

    if not os.path.exists(file_path):
        print(f"⚠️ Missing ensemble file for {species}. Skipping.")
        continue

    df = pd.read_csv(file_path)

    # Identify probability columns (excluding True_Label)
    probability_columns = [col for col in df.columns if col.endswith("_Probability")]

    if len(probability_columns) == 0:
        print(f"⚠️ No probability columns found for {species}. Skipping.")
        continue

    # Compute the averaged probability
    df["Ensemble_Average"] = df[probability_columns].mean(axis=1)

    # Save the final ensemble predictions
    output_file = os.path.join(ensemble_output_dir, f"{species}_Final_Ensemble_Predictions.csv")
    df.to_csv(output_file, index=False)
    
    print(f"✅ Saved averaged predictions for {species} at {output_file}")

print("\n🚀 Ensemble averaging complete! Ready for threshold selection.")


🔍 Computing averaged ensemble predictions for Bufo bufo...
✅ Saved averaged predictions for Bufo bufo at C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Final_Ensemble\Bufo bufo_Final_Ensemble_Predictions.csv
🔍 Computing averaged ensemble predictions for Rana temporaria...
✅ Saved averaged predictions for Rana temporaria at C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Final_Ensemble\Rana temporaria_Final_Ensemble_Predictions.csv
🔍 Computing averaged ensemble predictions for Lissotriton helveticus...
✅ Saved averaged predictions for Lissotriton helveticus at C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Final_Ensemble\Lissotriton helveticus_Final_Ensemble_Predictions.csv

🚀 Ensemble averaging complete! Ready for threshold selection.


### 3.3 Compute Weighted Ensemble Predictions

In [4]:
import os
import pandas as pd
import numpy as np

# 🗂 Define directories
model_dirs = {
    "GLM_Lasso": r"C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Final_GLM\GLM_Predictions",
    "GLM_Ridge": r"C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Final_GLM\GLM_Predictions",
    "GAM": r"C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Final_GAM\GAM_Predictions",
    "RF": r"C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\RandomForest",
    "XGBoost": r"C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\XGBoost",
    "MaxEnt": r"C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Maxent"
}

output_dir = r"C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Weighted_Ensemble"
os.makedirs(output_dir, exist_ok=True)

# 📌 Model Performance Metrics (AUC, F1, Precision, Recall)
model_metrics = {
    "Bufo bufo": {
        "GLM_Lasso": (0.820, 0.414, 0.278, 0.809),
        "GLM_Ridge": (0.819, 0.415, 0.279, 0.809),
        "GAM": (0.803, 0.412, 0.288, 0.725),
        "RF": (0.910, 0.896, 0.994, 0.816),
        "XGBoost": (0.875, 0.805, 0.809, 0.801),
        "MaxEnt": (0.867, 0.530, 0.419, 0.720),
    },
    "Rana temporaria": {
        "GLM_Lasso": (0.836, 0.362, 0.245, 0.696),
        "GLM_Ridge": (0.836, 0.362, 0.245, 0.696),
        "GAM": (0.837, 0.394, 0.273, 0.712),
        "RF": (0.949, 0.927, 1.000, 0.863),
        "XGBoost": (0.909, 0.846, 0.837, 0.855),
        "MaxEnt": (0.869, 0.467, 0.363, 0.656),
    },
    "Lissotriton helveticus": {
        "GLM_Lasso": (0.813, 0.423, 0.297, 0.736),
        "GLM_Ridge": (0.818, 0.417, 0.289, 0.744),
        "GAM": (0.778, 0.351, 0.250, 0.587),
        "RF": (0.915, 0.870, 0.989, 0.777),
        "XGBoost": (0.834, 0.779, 0.803, 0.757),
        "MaxEnt": (0.837, 0.440, 0.336, 0.636),
    }
}

# Weighting Factors (Adjustable)
alpha, beta, gamma, delta = 0.3, 0.3, 0.2, 0.2  # AUC, F1, Precision, Recall weights

# 🔄 Iterate through species
for species in model_metrics.keys():
    print(f"🔍 Computing weighted ensemble predictions for {species}...")

    # Compute model weights using weighted sum of metrics
    raw_weights = {}
    for model, metrics in model_metrics[species].items():
        auc, f1, precision, recall = metrics
        weight = (alpha * auc) + (beta * f1) + (gamma * precision) + (delta * recall)
        raw_weights[model] = weight

    # Normalize weights to sum to 1
    total_weight = sum(raw_weights.values())
    model_weights = {model: weight / total_weight for model, weight in raw_weights.items()}

    # Load model predictions and apply weights
    weighted_sum = None
    
    for model, weight in model_weights.items():
        file_path = os.path.join(model_dirs[model], f"{species}_{model}_TestPredictions.csv")

        if not os.path.exists(file_path):
            print(f"⚠️ Missing predictions for {model} - {species}. Skipping.")
            continue

        df = pd.read_csv(file_path)
        
        # Ensure correct column names
        df.rename(columns={df.columns[0]: "True_Label", df.columns[1]: f"{model}_Probability"}, inplace=True)

        # Apply weighted sum
        if weighted_sum is None:
            weighted_sum = df[["True_Label", f"{model}_Probability"]].copy()
            weighted_sum[f"{model}_Probability"] *= weight
        else:
            weighted_sum[f"{model}_Probability"] = df[f"{model}_Probability"] * weight

    # Compute final weighted probability
    weighted_sum["Weighted_Ensemble_Average"] = weighted_sum.iloc[:, 1:].sum(axis=1)

    # Save final weighted predictions
    output_file = os.path.join(output_dir, f"{species}_Weighted_Ensemble_Predictions.csv")
    weighted_sum[["True_Label", "Weighted_Ensemble_Average"]].to_csv(output_file, index=False)

    print(f"✅ Saved weighted predictions for {species} at {output_file}")

print("\n🚀 Weighted ensemble averaging complete! Ready for threshold selection.")


🔍 Computing weighted ensemble predictions for Bufo bufo...
⚠️ Missing predictions for GAM - Bufo bufo. Skipping.
⚠️ Missing predictions for RF - Bufo bufo. Skipping.
⚠️ Missing predictions for XGBoost - Bufo bufo. Skipping.
⚠️ Missing predictions for MaxEnt - Bufo bufo. Skipping.
✅ Saved weighted predictions for Bufo bufo at C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Weighted_Ensemble\Bufo bufo_Weighted_Ensemble_Predictions.csv
🔍 Computing weighted ensemble predictions for Rana temporaria...
⚠️ Missing predictions for GAM - Rana temporaria. Skipping.
⚠️ Missing predictions for RF - Rana temporaria. Skipping.
⚠️ Missing predictions for XGBoost - Rana temporaria. Skipping.
⚠️ Missing predictions for MaxEnt - Rana temporaria. Skipping.
✅ Saved weighted predictions for Rana temporaria at C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Weighted_Ensemble\Rana temporaria_Weighted_Ensemble_Predictions.csv
🔍 Computing weighted ensemble predictions for Lissotriton helveticus...


### 3.4 Selecting the Optimal Threshold for Presence-Absence Classification

## 7. References

Meller, L., Cabeza, M., Pironon, S., Barbet-Massin, M., Maiorano, L., Georges, D., & Thuiller, W. (2014). Ensemble distribution models in conservation prioritization: From consensus predictions to consensus reserve networks. *Diversity and Distributions*, 20(3), 309–321. https://doi.org/10.1111/ddi.12162

Ramirez-Reyes, C., Nazeri, M., Street, G., Jones-Farrand, D. T., Vilella, F. J., & Evans, K. O. (2021). Embracing ensemble species distribution models to inform at-risk species status assessments. *Journal of Fish and Wildlife Management*, 12(1), 98–111. https://doi.org/10.3996/JFWM-20-072