# Ensemble Modelling

## Table of Contents

1. [Introduction](#1.-Introduction)
2. [Ensemble Methodology](#2.-Ensemble-Methodology)

## 1. Introduction

Ensemble modelling is a machine learning technique that combines multiple individual models to improve predictive performance. By aggregating the strengths of diverse models, ensembles often achieve better accuracy and robustness than any single model alone. 

In the context of species distribution modelling (SDM), ensemble approaches integrate predictions from various statistical techniques to enhance the reliability of forecasts. This method accounts for uncertainties inherent in individual models, leading to more robust predictions.

### **Common Ensemble Methods**

1. **Bagging (Bootstrap Aggregating)**: This technique involves training multiple models on different subsets of the data, created through random sampling with replacement. The final prediction is typically an average (for regression) or majority vote (for classification) of the individual models' outputs. 

2. **Boosting**: Boosting sequentially trains models, each focusing on correcting the errors of its predecessor. Models are weighted based on their performance, and the ensemble combines them to produce a strong predictor. 

3. **Stacking**: In stacking, multiple models are trained to predict the same outcome. Their predictions are then used as inputs for a higher-level model, which learns to combine them optimally.

## 2. Ensemble Methodology

### **2.1 Selection of Models for Ensemble**
Based on the previous model evaluation and comparison, Random Forest (RF) and XGBoost consistently outperformed other models, demonstrating the highest AUC-ROC, precision, recall, and F1-score across all species. MaxEnt showed moderate performance, particularly in recall, but had limitations in precision, suggesting a tendency for overprediction. GLM and GAM performed the worst overall, indicating they may not fully capture the complexity of amphibian distributions.

Thus, this study will prioritise RF and XGBoost as the core models in the ensemble and consider MaxEnt for added diversity while downweighting its influence. GLM and GAM may still contribute to the ensemble for additional variance but will not drive final predictions.

### **2.2 Model Weighting and Aggregation Methods**
To integrate multiple models, this study will explore different ensemble techniques:

#### 1. Averaging Ensemble:
- Compute the mean probability of presence across RF, XGBoost, and MaxEnt.
- Weight models according to their precision and recall (e.g., RF and XGBoost given higher weight, MaxEnt downweighted).
#### 2. Majority Voting Ensemble (for binary presence/absence predictions):
- Classify a species as present if at least two out of three models predict presence.
#### 3. Stacked Ensemble (if time allows):
Train a meta-classifier (e.g., logistic regression) using predictions from individual models as inputs.

### **2.3 Calibration and Performance Evaluation**
To ensure the ensemble predictions are robust, the following evaluation metrics will be recalculated:

- AUC-ROC and Precision-Recall curves
- Sensitivity-specificity trade-offs
- Confusion matrix analysis
- Uncertainty quantification (standard deviation in predictions)

The ensemble's performance will be compared to individual models to determine whether it achieves higher predictive accuracy and reliability.

### **2.4 Spatial Mapping of Ensemble Predictions**
Once ensemble predictions are finalised, they will be spatially visualised using GIS tools to assess habitat suitability for target amphibian species. Uncertainty maps will also be generated to highlight regions with high model disagreement.

### **2.5 Methodology Rationale**
This study aims to leverage the advantages of ensemble modelling to provide more accurate, reliable, and ecologically meaningful habitat suitability predictions. The rationale for this approach is:
1. Tree-based models (RF and XGBoost) demonstrate strong performance and capture complex species-environment relationships.
2. MaxEnt contributes additional ecological insightsand has been widely used in SDMs, but its predictions will be weighted lower to account for overprediction tendencies.
3. Averaging and majority voting improve robustness, ensuring predictions are not overly reliant on any single model.
4. Uncertainty quantification will guide conservation decision-making, particularly for identifying regions where predictions are less certain.

By following this approach, the ensemble model will integrate the strengths of individual models, enhance predictive reliability, and contribute valuable insights for amphibian conservation and Blue-Green Infrastructure planning.

## 3. Ensemble Modelling
### 3.1 Load Model Predictions and Prepare for Ensemble

In [2]:
import os
import pandas as pd
import numpy as np

# Define base directory
base_dir = r"C:\GIS_Course\MScThesis-MaviSantarelli\results\Models"

# Define species list
species_list = ["Bufo bufo", "Rana temporaria", "Lissotriton helveticus"]

# Define selected models for ensemble
selected_models = ["RF", "XGBoost", "MaxEnt"]

# Define paths for each model
model_dirs = {
    "RF": os.path.join(base_dir, "RandomForest"),
    "XGBoost": os.path.join(base_dir, "XGBoost"),
    "MaxEnt": os.path.join(base_dir, "Maxent")
}

# Define file patterns for each model
file_patterns = {
    "RF": os.path.join("{species}", "Test_Predictions.csv"),
    "XGBoost": os.path.join("{species}", "Aggregated_Test_Predictions.csv"),
    "MaxEnt": "Maxent_{species}_TestPredictions.csv"
}

# Define output directory for merged predictions
ensemble_output_dir = os.path.join(base_dir, "Ensemble_Predictions")
os.makedirs(ensemble_output_dir, exist_ok=True)

# Iterate over each species
for species in species_list:
    print(f"🔍 Processing ensemble predictions for {species}...")

    merged_df = None  # Initialize dataframe for storing merged predictions

    for model in selected_models:
        formatted_species = species.replace(" ", "_")  # Adjust for file naming
        file_path = os.path.join(model_dirs[model], file_patterns[model].format(species=formatted_species))

        if not os.path.exists(file_path):
            print(f"⚠️ Missing prediction file for {species} - {model}: {file_path}")
            continue  # Skip this model if the file is missing

        # Load model predictions in chunks (if necessary)
        try:
            df = pd.read_csv(file_path, dtype={"True_Label": "int8", "Predicted_Probability": "float32"})
        except Exception as e:
            print(f"⚠️ Error loading {species} - {model}: {e}")
            continue

        # Rename columns
        df = df.rename(columns={"True_Label": "True_Label", "Predicted_Probability": f"{model}_Probability"})

        # Reduce memory usage
        df[f"{model}_Probability"] = df[f"{model}_Probability"].astype(np.float32)

        # Merge into a single dataframe
        if merged_df is None:
            merged_df = df.copy()
        else:
            merged_df = pd.concat([merged_df, df[f"{model}_Probability"]], axis=1)

        del df  # Free up memory

    # Save the merged predictions in chunks
    if merged_df is not None and not merged_df.empty:
        output_file = os.path.join(ensemble_output_dir, f"{species}_Ensemble_Predictions.csv")
        merged_df.to_csv(output_file, index=False)
        print(f"✅ Merged predictions saved: {output_file}")
    else:
        print(f"⚠️ No valid predictions available for {species}.")

    del merged_df  # Free memory after each species

print("\n🚀 Ensemble prediction files ready for next steps!")


🔍 Processing ensemble predictions for Bufo bufo...
✅ Merged predictions saved: C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Ensemble_Predictions\Bufo bufo_Ensemble_Predictions.csv
🔍 Processing ensemble predictions for Rana temporaria...
✅ Merged predictions saved: C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Ensemble_Predictions\Rana temporaria_Ensemble_Predictions.csv
🔍 Processing ensemble predictions for Lissotriton helveticus...
✅ Merged predictions saved: C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Ensemble_Predictions\Lissotriton helveticus_Ensemble_Predictions.csv

🚀 Ensemble prediction files ready for next steps!


### 3.2 Weighted Ensemble Averaging

**Why Use Averaging?**
- This approach reduces individual model biases and leverages the strengths of multiple models.
- Averaging probabilities smooths extreme values, leading to better generalisation.
- It is less prone to overfitting compared to single models.

The code below assigns weights based on **multiple performance metrics** (AUC-ROC, Precision, Recall, and F1-score), normalises them, and applies a weighted averaging scheme.

In [5]:
import os
import pandas as pd
import numpy as np

# Define input and output directories
ensemble_input_dir = r"C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Ensemble_Predictions"
ensemble_output_dir = r"C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Weighted_Ensemble"
os.makedirs(ensemble_output_dir, exist_ok=True)

# Define species list
species_list = ["Bufo bufo", "Rana temporaria", "Lissotriton helveticus"]

# Define performance metrics for weighting (these values should be updated with your actual model metrics)
performance_metrics = {
    "Bufo bufo": {
        "RF": {"AUC-ROC": 0.910, "Precision": 0.994, "Recall": 0.816, "F1": 0.896},
        "XGBoost": {"AUC-ROC": 0.875, "Precision": 0.809, "Recall": 0.801, "F1": 0.805},
        "MaxEnt": {"AUC-ROC": 0.867, "Precision": 0.419, "Recall": 0.720, "F1": 0.530},
    },
    "Rana temporaria": {
        "RF": {"AUC-ROC": 0.949, "Precision": 1.000, "Recall": 0.863, "F1": 0.927},
        "XGBoost": {"AUC-ROC": 0.909, "Precision": 0.837, "Recall": 0.855, "F1": 0.846},
        "MaxEnt": {"AUC-ROC": 0.869, "Precision": 0.363, "Recall": 0.656, "F1": 0.467},
    },
    "Lissotriton helveticus": {
        "RF": {"AUC-ROC": 0.915, "Precision": 0.989, "Recall": 0.777, "F1": 0.870},
        "XGBoost": {"AUC-ROC": 0.834, "Precision": 0.803, "Recall": 0.757, "F1": 0.779},
        "MaxEnt": {"AUC-ROC": 0.837, "Precision": 0.336, "Recall": 0.636, "F1": 0.440},
    },
}

# Normalise weights across species
for species in species_list:
    model_weights = {}
    for model in performance_metrics[species].keys():
        # Compute an average of all normalised scores for balanced weighting
        metrics = performance_metrics[species][model]
        model_weights[model] = np.mean([metrics["AUC-ROC"], metrics["Precision"], metrics["Recall"], metrics["F1"]])

    # Normalize weights to sum to 1
    total_weight = sum(model_weights.values())
    for model in model_weights.keys():
        model_weights[model] /= total_weight  # Scale to sum up to 1

    performance_metrics[species]["Weights"] = model_weights  # Store normalised weights

# Iterate through each species
for species in species_list:
    print(f"🔍 Computing weighted ensemble predictions for {species}...")

    # Load ensemble predictions file
    file_path = os.path.join(ensemble_input_dir, f"{species}_Ensemble_Predictions.csv")

    if not os.path.exists(file_path):
        print(f"⚠️ Missing ensemble file for {species}. Skipping.")
        continue

    df = pd.read_csv(file_path)

    # Identify probability columns
    probability_columns = [col for col in df.columns if col.endswith("_Probability")]

    if len(probability_columns) == 0:
        print(f"⚠️ No probability columns found for {species}. Skipping.")
        continue

    # Compute weighted ensemble probability
    weighted_predictions = np.zeros(len(df))

    for model in probability_columns:
        model_name = model.replace("_Probability", "")  # Extract model name
        if model_name in performance_metrics[species]["Weights"]:
            weighted_predictions += df[model] * performance_metrics[species]["Weights"][model_name]
        else:
            print(f"⚠️ Missing weight for {model_name} in {species}. Skipping.")

    df["Weighted_Ensemble"] = weighted_predictions

    # Save the weighted ensemble predictions
    output_file = os.path.join(ensemble_output_dir, f"{species}_Weighted_Ensemble_Predictions.csv")
    df.to_csv(output_file, index=False)

    print(f"✅ Saved weighted predictions for {species} at {output_file}")

print("\n🚀 Weighted ensemble averaging complete! Ready for threshold selection.")


🔍 Computing weighted ensemble predictions for Bufo bufo...
✅ Saved weighted predictions for Bufo bufo at C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Weighted_Ensemble\Bufo bufo_Weighted_Ensemble_Predictions.csv
🔍 Computing weighted ensemble predictions for Rana temporaria...
✅ Saved weighted predictions for Rana temporaria at C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Weighted_Ensemble\Rana temporaria_Weighted_Ensemble_Predictions.csv
🔍 Computing weighted ensemble predictions for Lissotriton helveticus...
✅ Saved weighted predictions for Lissotriton helveticus at C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Weighted_Ensemble\Lissotriton helveticus_Weighted_Ensemble_Predictions.csv

🚀 Weighted ensemble averaging complete! Ready for threshold selection.


### 3.3 Selecting the Optimal Threshold for Presence-Absence Classification

Now that the weighted ensemble predictions have been generated, the next step is threshold selection, which converts probability predictions into binary presence/absence values. We have multiple options to determine the best threshold:

#### 1. Maximising the F1-score:
* The best balance between precision and recall.
* Ideal if you want to avoid too many false positives or false negatives.

#### 2. Maximising the Youden Index (J-Statistic):
* The threshold that maximises (Sensitivity + Specificity - 1).
* Ensures both presence and absence are well-classified.

#### 3. Fixed Threshold (e.g., 0.5):
* Simple but not species-specific.
* May not be optimal given imbalanced data.

All three methods will be implemented and compared.

In [11]:
import os
import pandas as pd

# Define directory for weighted ensemble predictions
weighted_ensemble_dir = r"C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Weighted_Ensemble"

# Define species list
species_list = ["Bufo bufo", "Rana temporaria", "Lissotriton helveticus"]

# Iterate through each species
for species in species_list:
    file_path = os.path.join(weighted_ensemble_dir, f"{species}_Weighted_Ensemble_Predictions.csv")

    if not os.path.exists(file_path):
        print(f"⚠️ Missing weighted ensemble file for {species}. Skipping.")
        continue

    # Load data
    df = pd.read_csv(file_path)

    # Check for NaN values
    nan_counts = df.isnull().sum()
    print(f"\n🔍 {species} - Missing Values:\n{nan_counts}\n")

    # Drop rows with NaN values
    df = df.dropna()

    # Ensure True_Label is binary (0 or 1) and integer
    df["True_Label"] = df["True_Label"].astype(int)

    # Ensure Weighted_Probability is numeric
    df["Weighted_Probability"] = df["Weighted_Probability"].astype(float)

    # Save cleaned dataset
    df.to_csv(file_path, index=False)
    print(f"✅ Cleaned and saved: {file_path}")

print("\n🚀 Data cleaning complete! Now re-run threshold selection.")



🔍 Bufo bufo - Missing Values:
True_Label              3870
RF_Probability          3870
XGBoost_Probability        0
MaxEnt_Probability      2405
Weighted_Probability    3870
dtype: int64

✅ Cleaned and saved: C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Weighted_Ensemble\Bufo bufo_Weighted_Ensemble_Predictions.csv

🔍 Rana temporaria - Missing Values:
True_Label              5994
RF_Probability          5994
XGBoost_Probability        0
MaxEnt_Probability      2756
Weighted_Probability    5994
dtype: int64

✅ Cleaned and saved: C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Weighted_Ensemble\Rana temporaria_Weighted_Ensemble_Predictions.csv

🔍 Lissotriton helveticus - Missing Values:
True_Label              1827
RF_Probability          1827
XGBoost_Probability        0
MaxEnt_Probability       981
Weighted_Probability    1827
dtype: int64

✅ Cleaned and saved: C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Weighted_Ensemble\Lissotriton helveticus_Weighted_Ensemb

In [12]:
import os
import pandas as pd
import numpy as np
from sklearn.metrics import roc_curve, precision_recall_curve, f1_score

# Define input/output directories
weighted_ensemble_dir = r"C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Weighted_Ensemble"
binary_output_dir = r"C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Final_Binary"
os.makedirs(binary_output_dir, exist_ok=True)

# Define species list
species_list = ["Bufo bufo", "Rana temporaria", "Lissotriton helveticus"]

# Function to find the best threshold using F1-score and Youden Index
def find_best_threshold(y_true, y_pred_prob):
    # Compute F1-score optimal threshold
    precisions, recalls, thresholds_pr = precision_recall_curve(y_true, y_pred_prob)
    f1_scores = (2 * precisions * recalls) / (precisions + recalls + 1e-9)  # Avoid division by zero
    best_f1_threshold = thresholds_pr[np.argmax(f1_scores)]

    # Compute Youden's J Index optimal threshold
    fpr, tpr, thresholds_roc = roc_curve(y_true, y_pred_prob)
    youden_index = tpr - fpr
    best_youden_threshold = thresholds_roc[np.argmax(youden_index)]

    return best_f1_threshold, best_youden_threshold

# Iterate through each species
for species in species_list:
    print(f"🔍 Selecting threshold for {species}...")

    # Load weighted ensemble predictions
    file_path = os.path.join(weighted_ensemble_dir, f"{species}_Weighted_Ensemble_Predictions.csv")

    if not os.path.exists(file_path):
        print(f"⚠️ Missing weighted ensemble file for {species}. Skipping.")
        continue

    df = pd.read_csv(file_path)

    # Ensure required columns exist
    if "True_Label" not in df.columns or "Weighted_Probability" not in df.columns:
        print(f"⚠️ Columns missing in {species} predictions. Skipping.")
        continue

    y_true = df["True_Label"].values
    y_pred_prob = df["Weighted_Probability"].values

    # Compute best thresholds
    best_f1_threshold, best_youden_threshold = find_best_threshold(y_true, y_pred_prob)

    # Apply thresholds to create binary presence/absence classifications
    df["Binary_F1"] = (df["Weighted_Probability"] >= best_f1_threshold).astype(int)
    df["Binary_Youden"] = (df["Weighted_Probability"] >= best_youden_threshold).astype(int)
    df["Binary_0.5"] = (df["Weighted_Probability"] >= 0.5).astype(int)  # Fixed threshold

    # Save binary classification results
    output_file = os.path.join(binary_output_dir, f"{species}_Final_Binary_Predictions.csv")
    df.to_csv(output_file, index=False)
    
    print(f"✅ Saved binary predictions for {species} at {output_file}")

print("\n🚀 Threshold selection complete! Ready for final evaluation.")


🔍 Selecting threshold for Bufo bufo...
✅ Saved binary predictions for Bufo bufo at C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Final_Binary\Bufo bufo_Final_Binary_Predictions.csv
🔍 Selecting threshold for Rana temporaria...
✅ Saved binary predictions for Rana temporaria at C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Final_Binary\Rana temporaria_Final_Binary_Predictions.csv
🔍 Selecting threshold for Lissotriton helveticus...
✅ Saved binary predictions for Lissotriton helveticus at C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Final_Binary\Lissotriton helveticus_Final_Binary_Predictions.csv

🚀 Threshold selection complete! Ready for final evaluation.


## 7. References

Meller, L., Cabeza, M., Pironon, S., Barbet-Massin, M., Maiorano, L., Georges, D., & Thuiller, W. (2014). Ensemble distribution models in conservation prioritization: From consensus predictions to consensus reserve networks. *Diversity and Distributions*, 20(3), 309–321. https://doi.org/10.1111/ddi.12162

Ramirez-Reyes, C., Nazeri, M., Street, G., Jones-Farrand, D. T., Vilella, F. J., & Evans, K. O. (2021). Embracing ensemble species distribution models to inform at-risk species status assessments. *Journal of Fish and Wildlife Management*, 12(1), 98–111. https://doi.org/10.3996/JFWM-20-072