# Model Comparison and Ensemble Modelling

## Table of Contents
1. [Introduction](#1.-Introduction)
2. [Data Preparation](#2.-Data-Preparation)

## 1. Introduction

### **Objective:**
To assess and compare the performance of multiple SDMs (GLM, GAM, RF, XGBoost, MaxEnt) and explore ensemble modelling techniques to improve species distribution predictions.

### **Rationale:**
Comparing different models allows us to identify strengths and weaknesses in their predictive capabilities. Ensemble modelling combines multiple models to leverage their individual strengths, often leading to improved accuracy and robustness in predictions (Meller et al., 2014). 

In [1]:
import os
import joblib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import (
    roc_auc_score, roc_curve, precision_recall_curve, 
    precision_score, recall_score, f1_score, confusion_matrix
)


## 2. Data Preparation

### **Steps:**

#### 1. Load Test Predictions:
- Import the test prediction results from each model for all species.

#### 2. Ensure Consistency:
- Verify that all datasets have consistent formatting, with aligned columns for true labels and predicted probabilities.

### **Rationale:**
Consistent data formatting is crucial for accurate performance evaluation and comparison across models.

In [18]:
import os
import joblib
import pandas as pd

# Base directory where all models & results are stored
base_dir = r"C:\GIS_Course\MScThesis-MaviSantarelli\results\Models"

# Species list
species_list = ["Bufo bufo", "Rana temporaria", "Lissotriton helveticus"]

# Dictionary to store all loaded data
models = {"GLM": {}, "GAM": {}, "RF": {}, "XGBoost": {}, "MaxEnt": {}}
metrics = {"GLM": {}, "GAM": {}, "RF": {}, "XGBoost": {}, "MaxEnt": {}}
test_predictions = {"GLM": {}, "GAM": {}, "RF": {}, "XGBoost": {}, "MaxEnt": {}}

# =========================== 1️⃣ Load GLM Models & Predictions ===========================
glm_dir = os.path.join(base_dir, "Final_GLM")
glm_pred_dir = os.path.join(glm_dir, "GLM_Predictions")
glm_versions = ["Lasso", "Ridge"]

for species in species_list:
    models["GLM"][species] = {}
    metrics["GLM"][species] = {}
    test_predictions["GLM"][species] = {}  # ✅ Initialise as an empty dictionary

    for glm_type in glm_versions:
        # Load GLM models
        model_path = os.path.join(glm_dir, "Models", f"{species}_GLM_{glm_type}_Threshold_0.3_Model.pkl")
        if os.path.exists(model_path):
            models["GLM"][species][glm_type] = joblib.load(model_path)
        
        # Load GLM test predictions
        pred_path = os.path.join(glm_pred_dir, f"{species}_GLM_{glm_type}_TestPredictions.csv")
        if os.path.exists(pred_path):
            test_predictions["GLM"][species][glm_type] = pd.read_csv(pred_path)  # ✅ Now it won't throw KeyError
        
        # Load GLM metrics
        metrics_path = os.path.join(glm_dir, f"{species}_Threshold_0.3_Metrics.txt")
        if os.path.exists(metrics_path):
            with open(metrics_path, "r") as f:
                metrics["GLM"][species][glm_type] = f.read()

# =========================== 2️⃣ Load GAM Models & Predictions ===========================
gam_dir = os.path.join(base_dir, "Final_GAM")
gam_pred_dir = os.path.join(gam_dir, "GAM_Predictions")

for species in species_list:
    test_predictions["GAM"][species] = {}  # ✅ Initialise dictionary

    # Load GAM model
    model_path = os.path.join(gam_dir, f"{species}_GAM_Model_CV.pkl")
    if os.path.exists(model_path):
        models["GAM"][species] = joblib.load(model_path)

    # Load GAM test predictions
    pred_path = os.path.join(gam_pred_dir, f"{species}_GAM_TestPredictions.csv")
    if os.path.exists(pred_path):
        test_predictions["GAM"][species] = pd.read_csv(pred_path)

    # Load GAM metrics
    metrics_path = os.path.join(gam_dir, f"{species}_GAM_Test_Metrics.txt")
    if os.path.exists(metrics_path):
        with open(metrics_path, "r") as f:
            metrics["GAM"][species] = f.read()

# =========================== 3️⃣ Load RF Models, Metrics & Predictions ===========================
rf_dir = os.path.join(base_dir, "RandomForest")

for species in species_list:
    test_predictions["RF"][species] = {}  # ✅ Initialise dictionary

    species_dir = os.path.join(rf_dir, species)

    # Load RF model
    model_path = os.path.join(species_dir, "RandomForest_Model.pkl")
    if os.path.exists(model_path):
        models["RF"][species] = joblib.load(model_path)

    # Load RF test predictions
    pred_path = os.path.join(species_dir, "Test_Predictions.csv")
    if os.path.exists(pred_path):
        test_predictions["RF"][species] = pd.read_csv(pred_path)

    # Load RF metrics
    metrics_path = os.path.join(species_dir, "RandomForest_Metrics.txt")
    if os.path.exists(metrics_path):
        with open(metrics_path, "r") as f:
            metrics["RF"][species] = f.read()

# =========================== 4️⃣ Load XGBoost Models, Metrics & Predictions ===========================
xgb_dir = os.path.join(base_dir, "XGBoost")

for species in species_list:
    test_predictions["XGBoost"][species] = {}  # ✅ Initialise dictionary

    species_dir = os.path.join(xgb_dir, species)

    # Load XGBoost model
    model_path = os.path.join(species_dir, "XGBoost_Model.pkl")
    if os.path.exists(model_path):
        models["XGBoost"][species] = joblib.load(model_path)

    # Load XGBoost test predictions
    pred_path = os.path.join(species_dir, "Aggregated_Test_Predictions.csv")
    if os.path.exists(pred_path):
        test_predictions["XGBoost"][species] = pd.read_csv(pred_path)

    # Load XGBoost metrics
    metrics_path = os.path.join(species_dir, "XGBoost_Metrics.txt")
    if os.path.exists(metrics_path):
        with open(metrics_path, "r") as f:
            metrics["XGBoost"][species] = f.read()

# =========================== 5️⃣ Load MaxEnt Test Predictions & Metrics ===========================
maxent_dir = os.path.join(base_dir, "Maxent")

for species in species_list:
    test_predictions["MaxEnt"][species] = {}  # ✅ Initialise dictionary

    # Load MaxEnt test predictions
    pred_path = os.path.join(maxent_dir, f"Maxent_{species.replace(' ', '_')}_TestPredictions.csv")
    if os.path.exists(pred_path):
        test_predictions["MaxEnt"][species] = pd.read_csv(pred_path)

# Load MaxEnt model evaluation metrics
metrics_path = os.path.join(maxent_dir, "Maxent_Model_Evaluation.csv")
if os.path.exists(metrics_path):
    metrics["MaxEnt"] = pd.read_csv(metrics_path)

# =========================== ✅ Summary ===========================
print("\n✅ All models, test predictions, and metrics loaded successfully!")



✅ All models, test predictions, and metrics loaded successfully!


In [19]:
# Load Variable Mapping File (if required)
var_mapping_path = os.path.join(base_dir, "Variable_Mapping.csv")

if os.path.exists(var_mapping_path):
    variable_mapping = pd.read_csv(var_mapping_path)
    print("✅ Variable mapping file loaded successfully!")
else:
    variable_mapping = None
    print("⚠️ Warning: Variable mapping file not found.")


✅ Variable mapping file loaded successfully!


In [22]:
import os
import pandas as pd

# Define base directories for all models
base_dir = r"C:\GIS_Course\MScThesis-MaviSantarelli\results\Models"

# Define species list
species_list = ["Bufo bufo", "Rana temporaria", "Lissotriton helveticus"]

# Define expected columns
expected_columns = ["True_Label", "Predicted_Probability"]

# Define paths for each model
model_dirs = {
    "GLM": os.path.join(base_dir, "Final_GLM", "GLM_Predictions"),
    "GAM": os.path.join(base_dir, "Final_GAM", "GAM_Predictions"),
    "RF": os.path.join(base_dir, "RandomForest"),
    "XGBoost": os.path.join(base_dir, "XGBoost"),
    "MaxEnt": os.path.join(base_dir, "Maxent")
}

# Define filenames (model-specific variations)
file_patterns = {
    "GLM": "{species}_GLM_{glm_type}_TestPredictions.csv",
    "GAM": "{species}_GAM_TestPredictions.csv",
    "RF": os.path.join("{species}", "Test_Predictions.csv"),
    "XGBoost": os.path.join("{species}", "Aggregated_Test_Predictions.csv"),
    "MaxEnt": "Maxent_{species}_TestPredictions.csv"
}

# GLM-specific model types
glm_types = ["Lasso", "Ridge"]

# Dictionary to store results
inconsistencies = {}

# Function to check file existence, column names, and format
def check_predictions():
    global inconsistencies
    inconsistencies = {}  # Reset inconsistencies
    
    for model, model_dir in model_dirs.items():
        inconsistencies[model] = {}
        
        for species in species_list:
            formatted_species = species.replace(" ", "_")  # Adjust species naming for files
            
            if model == "GLM":
                for glm_type in glm_types:
                    file_path = os.path.join(model_dir, file_patterns[model].format(species=species, glm_type=glm_type))
                    check_file(file_path, model, species, glm_type)
            else:
                file_path = os.path.join(model_dir, file_patterns[model].format(species=formatted_species))
                check_file(file_path, model, species)
    
    print("\n✅ Consistency check complete!\n")
    return inconsistencies

# Function to check individual files
def check_file(file_path, model, species, glm_type=None):
    if model not in inconsistencies:
        inconsistencies[model] = {}
    
    # Check if file exists
    if not os.path.exists(file_path):
        inconsistencies[model][species] = f"❌ Missing file: {file_path}"
        return
    
    # Load the file
    try:
        df = pd.read_csv(file_path)
    except Exception as e:
        inconsistencies[model][species] = f"❌ Error loading file: {str(e)}"
        return

    # Print actual column names for debugging
    print(f"🔍 {model} - {species}: Columns Found -> {df.columns.tolist()}")

    # Automatically detect the correct column names
    true_label_col = None
    predicted_prob_col = None
    
    for col in df.columns:
        if "true" in col.lower() or "label" in col.lower():
            true_label_col = col
        if "prob" in col.lower() or "prediction" in col.lower():
            predicted_prob_col = col

    # Rename columns if detected correctly
    if true_label_col and predicted_prob_col:
        df = df.rename(columns={true_label_col: "True_Label", predicted_prob_col: "Predicted_Probability"})
    else:
        inconsistencies[model][species] = f"⚠️ Column mismatch: Found {df.columns.tolist()}, expected {expected_columns}"
        return

    # Check for missing values
    if df.isnull().any().sum() > 0:
        inconsistencies[model][species] = "⚠️ Missing values found in test predictions"

    # Check data types
    if not df["True_Label"].dtype == "int64":
        inconsistencies[model][species] = "⚠️ 'True_Label' column should be int64"
    if not pd.api.types.is_numeric_dtype(df["Predicted_Probability"]):
        inconsistencies[model][species] = "⚠️ 'Predicted_Probability' column should be numeric"

    # Save back the corrected version
    df.to_csv(file_path, index=False)
    print(f"✅ Standardised and saved: {file_path}")

# Run the consistency check
inconsistencies = check_predictions()

# Print inconsistencies found
for model, issues in inconsistencies.items():
    if issues:
        print(f"\n🔍 Inconsistencies in {model}:")
        for species, issue in issues.items():
            print(f"  - {species}: {issue}")

print("\n🚀 Ready to proceed once inconsistencies are resolved!")


🔍 GLM - Bufo bufo: Columns Found -> ['True_Label', 'Predicted_Probability']
✅ Standardised and saved: C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Final_GLM\GLM_Predictions\Bufo bufo_GLM_Lasso_TestPredictions.csv
🔍 GLM - Bufo bufo: Columns Found -> ['True_Label', 'Predicted_Probability']
✅ Standardised and saved: C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Final_GLM\GLM_Predictions\Bufo bufo_GLM_Ridge_TestPredictions.csv
🔍 GLM - Rana temporaria: Columns Found -> ['True_Label', 'Predicted_Probability']
✅ Standardised and saved: C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Final_GLM\GLM_Predictions\Rana temporaria_GLM_Lasso_TestPredictions.csv
🔍 GLM - Rana temporaria: Columns Found -> ['True_Label', 'Predicted_Probability']
✅ Standardised and saved: C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Final_GLM\GLM_Predictions\Rana temporaria_GLM_Ridge_TestPredictions.csv
🔍 GLM - Lissotriton helveticus: Columns Found -> ['True_Label', 'Predicted_Probability']

## 3. Model Performance Evaluation

### **Metrics used in Evaluation**:

- **Area Under the Receiver Operating Characteristic Curve (AUC-ROC)**: Measures the ability of the model to distinguish between classes.
- **Confusion Matrix:** Provides a summary of prediction results, showing true positives, true negatives, false positives, and false negatives.
- **Precision, Recall, and F1-Score:** Evaluate the accuracy of positive predictions, the ability to find all positive instances, and the balance between precision and recall, respectively.

### **Rationale:**
These metrics offer a comprehensive view of each model's performance, highlighting different aspects of predictive accuracy and error rates.

### 3.1 Load all test predictions into a structured format
- Combine results into a single pandas DataFrame for easy comparison.

In [23]:
import os
import pandas as pd
import numpy as np
from sklearn.metrics import (
    roc_auc_score, precision_score, recall_score, f1_score, roc_curve, precision_recall_curve
)


In [25]:
# Base directory containing the test predictions
base_dir = r"C:\GIS_Course\MScThesis-MaviSantarelli\results\Models"

# Define species list
species_list = ["Bufo bufo", "Rana temporaria", "Lissotriton helveticus"]

# Define model names
models = ["GLM_Lasso", "GLM_Ridge", "GAM", "RF", "XGBoost", "MaxEnt"]

# Dictionary to store test predictions
test_predictions = {model: {} for model in models}

# File paths for each model
file_paths = {
    "GLM_Lasso": os.path.join(base_dir, "Final_GLM", "GLM_Predictions", "{species}_GLM_Lasso_TestPredictions.csv"),
    "GLM_Ridge": os.path.join(base_dir, "Final_GLM", "GLM_Predictions", "{species}_GLM_Ridge_TestPredictions.csv"),
    "GAM": os.path.join(base_dir, "Final_GAM", "GAM_Predictions", "{species}_GAM_TestPredictions.csv"),
    "RF": os.path.join(base_dir, "RandomForest", "{species}", "Test_Predictions.csv"),
    "XGBoost": os.path.join(base_dir, "XGBoost", "{species}", "Aggregated_Test_Predictions.csv"),
    "MaxEnt": os.path.join(base_dir, "Maxent", "Maxent_{species}_TestPredictions.csv")
}

# Standard column names for consistency
expected_columns = ["True_Label", "Predicted_Probability"]

# Load each model's test predictions
for model, path_template in file_paths.items():
    for species in species_list:
        formatted_species = species.replace(" ", "_")  # Ensure file names match correctly
        file_path = path_template.format(species=formatted_species)
        
        if os.path.exists(file_path):
            df = pd.read_csv(file_path)

            # Standardise column names if needed
            df.columns = [col.lower().replace(" ", "_") for col in df.columns]
            if "label" in df.columns:
                df.rename(columns={"label": "true_label"}, inplace=True)
            if "prediction" in df.columns:
                df.rename(columns={"prediction": "predicted_probability"}, inplace=True)
            if "avg_prediction" in df.columns:
                df.rename(columns={"avg_prediction": "predicted_probability"}, inplace=True)

            # Store the standardised dataframe
            test_predictions[model][species] = df
            print(f"✅ Loaded test predictions for {species} - {model}")
        else:
            print(f"⚠️ Missing test predictions for {species} - {model}")


✅ Loaded test predictions for Bufo bufo - GLM_Lasso
✅ Loaded test predictions for Rana temporaria - GLM_Lasso
✅ Loaded test predictions for Lissotriton helveticus - GLM_Lasso
✅ Loaded test predictions for Bufo bufo - GLM_Ridge
✅ Loaded test predictions for Rana temporaria - GLM_Ridge
✅ Loaded test predictions for Lissotriton helveticus - GLM_Ridge
✅ Loaded test predictions for Bufo bufo - GAM
✅ Loaded test predictions for Rana temporaria - GAM
✅ Loaded test predictions for Lissotriton helveticus - GAM
✅ Loaded test predictions for Bufo bufo - RF
✅ Loaded test predictions for Rana temporaria - RF
✅ Loaded test predictions for Lissotriton helveticus - RF
✅ Loaded test predictions for Bufo bufo - XGBoost
✅ Loaded test predictions for Rana temporaria - XGBoost
✅ Loaded test predictions for Lissotriton helveticus - XGBoost
✅ Loaded test predictions for Bufo bufo - MaxEnt
✅ Loaded test predictions for Rana temporaria - MaxEnt
✅ Loaded test predictions for Lissotriton helveticus - MaxEnt


### 3.2 Compute Performance Metrics
- Calculate AUC-ROC, Precision, Recall, and F1-score for each model and species.

In [26]:
# Dictionary to store performance metrics
performance_metrics = {species: {} for species in species_list}

# Compute metrics for each species and model
for species in species_list:
    performance_metrics[species] = {}

    for model in models:
        if species in test_predictions[model]:  # Ensure predictions exist
            df = test_predictions[model][species]
            
            # Extract true labels and predicted probabilities
            y_true = df["true_label"]
            y_pred_prob = df["predicted_probability"]
            y_pred = (y_pred_prob >= 0.5).astype(int)  # Convert probability to binary predictions

            # Compute performance metrics
            auc_roc = roc_auc_score(y_true, y_pred_prob)
            precision = precision_score(y_true, y_pred)
            recall = recall_score(y_true, y_pred)
            f1 = f1_score(y_true, y_pred)

            # Store results
            performance_metrics[species][model] = {
                "AUC-ROC": round(auc_roc, 3),
                "Precision": round(precision, 3),
                "Recall": round(recall, 3),
                "F1 Score": round(f1, 3)
            }
        else:
            print(f"⚠️ Skipping {species} - {model} (Missing test predictions)")

print("\n✅ Performance metrics calculated successfully!")



✅ Performance metrics calculated successfully!


### 3.3 Create a Summary Table
- Display model performances side by side.

In [27]:
# Convert the nested dictionary into a pandas DataFrame
summary_df = pd.DataFrame.from_dict(
    {(species, model): metrics for species, models in performance_metrics.items() for model, metrics in models.items()},
    orient="index"
)

# Reset index for better formatting
summary_df.index = pd.MultiIndex.from_tuples(summary_df.index, names=["Species", "Model"])
summary_df.reset_index(inplace=True)

# Display summary
import ace_tools as tools
tools.display_dataframe_to_user(name="Model Performance Summary", dataframe=summary_df)


ModuleNotFoundError: No module named 'ace_tools'

## 4. Visualisation of Model Performance

### **Steps:**
#### 1. Plot ROC Curves:
- Visualise the trade-off between true positive and false positive rates for each model.

#### 2. Compare Performance Metrics:
- Create bar plots or tables to compare precision, recall, and F1-scores across models.

### **Rationale:**
Visual representations facilitate intuitive comparisons and help identify models that perform well across various metrics.

## 5. Ensemble Modelling

### **Techniques to Explore:**

- **Weighted Averaging:** Combine model predictions by assigning weights proportional to their performance metrics.
- **Majority Voting:** For classification tasks, predict the class that receives the majority vote from individual models.

### **Rationale:**
Ensemble methods aim to harness the strengths of multiple models, often resulting in improved predictive performance and reduced variance (Meller et al., 2014). 

## 6. Evaluation of Ensemble Models

### **Steps:**
- **Compute Performance Metrics:** Assess the ensemble models using the same metrics as individual models (AUC-ROC, precision, recall, F1-score).
- **Compare to Individual Models:** Determine if the ensemble models outperform individual models.

### **Rationale:**
Evaluating ensemble models against individual models helps in understanding the added value of ensembling techniques.

## 7. Conclusion and Next Steps


## References

Meller, L., Cabeza, M., Pironon, S., Barbet-Massin, M., Maiorano, L., Georges, D., & Thuiller, W. (2014). Ensemble distribution models in conservation prioritization: From consensus predictions to consensus reserve networks. *Diversity and Distributions*, 20(3), 309–321. https://doi.org/10.1111/ddi.12162