# Model Comparison and Ensemble Modelling

## 1. Introduction

### **Objective:**
To assess and compare the performance of multiple SDMs (GLM, GAM, RF, XGBoost, MaxEnt) and explore ensemble modelling techniques to improve species distribution predictions.

### **Rationale:**
Comparing different models allows us to identify strengths and weaknesses in their predictive capabilities. Ensemble modelling combines multiple models to leverage their individual strengths, often leading to improved accuracy and robustness in predictions (Meller et al., 2014). 

In [1]:
import os
import joblib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import (
    roc_auc_score, roc_curve, precision_recall_curve, 
    precision_score, recall_score, f1_score, confusion_matrix
)


## 2. Data Preparation

### **Steps:**

#### 1. Load Test Predictions:
- Import the test prediction results from each model for all species.

#### 2. Ensure Consistency:
- Verify that all datasets have consistent formatting, with aligned columns for true labels and predicted probabilities.

### **Rationale:**
Consistent data formatting is crucial for accurate performance evaluation and comparison across models.

### 2.1 Loading GLM Files

In [2]:
import os
import joblib
import pandas as pd

# Define paths to final GLM models and metrics
final_glm_dir = "C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/Final_GLM/"
model_dir = os.path.join(final_glm_dir, "Models")  # Path to final models

# Species list
species_list = ["Bufo bufo", "Rana temporaria", "Lissotriton helveticus"]
glm_versions = ["Lasso", "Ridge"]  # Regularised models

# Dictionary to store models and metrics
glm_models = {}
glm_metrics = {}

# Load GLM models and metrics
for species in species_list:
    glm_models[species] = {}
    glm_metrics[species] = {}

    for glm_type in glm_versions:
        model_path = os.path.join(model_dir, f"{species}_GLM_{glm_type}_Threshold_0.3_Model.pkl")
        metrics_path = os.path.join(final_glm_dir, f"{species}_Threshold_0.3_Metrics.txt")
        
        if os.path.exists(model_path):
            glm_models[species][glm_type] = joblib.load(model_path)
            print(f"Loaded {glm_type} model for {species}")
        else:
            print(f"⚠️ Warning: Model not found for {species} - {glm_type}")

        if os.path.exists(metrics_path):
            with open(metrics_path, "r") as f:
                glm_metrics[species][glm_type] = f.read()
            print(f"Loaded metrics for {species} - {glm_type}")
        else:
            print(f"⚠️ Warning: Metrics file not found for {species} - {glm_type}")

print("✅ Final GLM models and metrics loaded successfully!")


Loaded Lasso model for Bufo bufo
Loaded metrics for Bufo bufo - Lasso
Loaded Ridge model for Bufo bufo
Loaded metrics for Bufo bufo - Ridge
Loaded Lasso model for Rana temporaria
Loaded metrics for Rana temporaria - Lasso
Loaded Ridge model for Rana temporaria
Loaded metrics for Rana temporaria - Ridge
Loaded Lasso model for Lissotriton helveticus
Loaded metrics for Lissotriton helveticus - Lasso
Loaded Ridge model for Lissotriton helveticus
Loaded metrics for Lissotriton helveticus - Ridge
✅ Final GLM models and metrics loaded successfully!


### 2.2 Loading GAM Files

In [3]:
import os
import joblib
import pandas as pd

# Define base directory for final GAM models and metrics
gam_results_dir = "C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/Final_GAM"

# Define species list (keeping spaces instead of converting to underscores)
species_list = ["Bufo bufo", "Rana temporaria", "Lissotriton helveticus"]

# Dictionary to store loaded models and metrics
gam_models = {}
gam_metrics = {}

# Load models and metrics
for species in species_list:
    # Use original filenames without changing spaces to underscores
    model_path = os.path.join(gam_results_dir, f"{species}_GAM_Model_CV.pkl")
    metrics_path = os.path.join(gam_results_dir, f"{species}_GAM_Test_Metrics.txt")

    # Load the model
    if os.path.exists(model_path):
        gam_models[species] = joblib.load(model_path)
        print(f" Loaded GAM model for {species}")
    else:
        print(f"⚠️ GAM model file missing for {species}: {model_path}")

    # Load the metrics
    if os.path.exists(metrics_path):
        with open(metrics_path, "r") as f:
            gam_metrics[species] = f.read()
        print(f" Loaded GAM metrics for {species}")
    else:
        print(f"⚠️ GAM metrics file missing for {species}: {metrics_path}")

print("\n✅ Final GAM models and metrics loaded successfully!")


 Loaded GAM model for Bufo bufo
 Loaded GAM metrics for Bufo bufo
 Loaded GAM model for Rana temporaria
 Loaded GAM metrics for Rana temporaria
 Loaded GAM model for Lissotriton helveticus
 Loaded GAM metrics for Lissotriton helveticus

✅ Final GAM models and metrics loaded successfully!


### 2.3 Loading RF Files

In [4]:
import os
import joblib
import pandas as pd

# Define base directory where the models are stored
base_output_dir = "C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/RandomForest"

# List of species to load models for
species_list = ["Bufo_bufo", "Rana_temporaria", "Lissotriton_helveticus"]

# Dictionary to store loaded models and metrics
rf_models = {}
rf_metrics = {}
rf_predictions = {}

for species in species_list:
    species_dir = os.path.join(base_output_dir, species)
    print(f"Loading Random Forest model and metrics for {species}...")
    
    # Load the final trained model
    model_path = os.path.join(species_dir, "RandomForest_Model.pkl")
    if os.path.exists(model_path):
        rf_models[species] = joblib.load(model_path)
        print(f"Loaded model for {species}")
    else:
        print(f"⚠️ Model file missing for {species}: {model_path}")
    
    # Load metrics
    metrics_path = os.path.join(species_dir, "RandomForest_Metrics.txt")
    if os.path.exists(metrics_path):
        with open(metrics_path, "r") as f:
            rf_metrics[species] = f.read()
        print(f"Loaded metrics for {species}")
    else:
        print(f"⚠️ Metrics file missing for {species}: {metrics_path}")
    
    # Load test predictions
    predictions_path = os.path.join(species_dir, "Test_Predictions.csv")
    if os.path.exists(predictions_path):
        rf_predictions[species] = pd.read_csv(predictions_path)
        print(f"Loaded test predictions for {species}")
    else:
        print(f"⚠️ Test predictions file missing for {species}: {predictions_path}")

print("\n✅ Random Forest models and metrics loaded successfully!")


Loading Random Forest model and metrics for Bufo_bufo...
Loaded model for Bufo_bufo
Loaded metrics for Bufo_bufo
Loaded test predictions for Bufo_bufo
Loading Random Forest model and metrics for Rana_temporaria...
Loaded model for Rana_temporaria
Loaded metrics for Rana_temporaria
Loaded test predictions for Rana_temporaria
Loading Random Forest model and metrics for Lissotriton_helveticus...
Loaded model for Lissotriton_helveticus
Loaded metrics for Lissotriton_helveticus
Loaded test predictions for Lissotriton_helveticus

✅ Random Forest models and metrics loaded successfully!


### 2.4 Loading XGBoost Files

In [5]:
import os
import joblib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Base directory for XGBoost models
base_output_dir = "C:/GIS_Course/MScThesis-MaviSantarelli/results/Models"
xgb_output_dir = os.path.join(base_output_dir, "XGBoost")

# List of species
species_list = ["Bufo_bufo", "Rana_temporaria", "Lissotriton_helveticus"]

# Dictionary to store loaded models and metrics
xgb_models = {}
xgb_metrics = {}
xgb_test_predictions = {}

# Load models, metrics, and test predictions
for species in species_list:
    print(f"Loading XGBoost model and metrics for {species}...")

    species_output_dir = os.path.join(xgb_output_dir, species)
    
    # Load final model
    model_path = os.path.join(species_output_dir, "XGBoost_Model.pkl")
    xgb_models[species] = joblib.load(model_path)
    print(f"  Loaded model for {species}")

    # Load metrics
    metrics_path = os.path.join(species_output_dir, "XGBoost_Metrics.txt")
    with open(metrics_path, "r") as f:
        metrics = f.readlines()
    xgb_metrics[species] = {line.split(":")[0].strip(): float(line.split(":")[1].strip()) for line in metrics}
    print(f"  Loaded metrics for {species}")

    # Load test predictions
    test_predictions_path = os.path.join(species_output_dir, "Aggregated_Test_Predictions.csv")
    xgb_test_predictions[species] = pd.read_csv(test_predictions_path)
    print(f"  Loaded test predictions for {species}")

print("\n✅ XGBoost models and metrics loaded successfully!")


Loading XGBoost model and metrics for Bufo_bufo...
  Loaded model for Bufo_bufo
  Loaded metrics for Bufo_bufo
  Loaded test predictions for Bufo_bufo
Loading XGBoost model and metrics for Rana_temporaria...
  Loaded model for Rana_temporaria
  Loaded metrics for Rana_temporaria
  Loaded test predictions for Rana_temporaria
Loading XGBoost model and metrics for Lissotriton_helveticus...
  Loaded model for Lissotriton_helveticus
  Loaded metrics for Lissotriton_helveticus
  Loaded test predictions for Lissotriton_helveticus

✅ XGBoost models and metrics loaded successfully!


### 2.5 Loading Maxent Files  

In [6]:
import os
import pandas as pd

# Define the correct directory path
base_output_dir = r"C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Maxent"

# Define file paths using os.path.join
maxent_files = {
    "Bufo bufo": {
        "test_predictions": os.path.join(base_output_dir, "Maxent_Bufo_bufo_TestPredictions.csv"),
        "model": os.path.join(base_output_dir, "Maxent_Bufo_bufo.rds"),
    },
    "Rana temporaria": {
        "test_predictions": os.path.join(base_output_dir, "Maxent_Rana_temporaria_TestPredictions.csv"),
        "model": os.path.join(base_output_dir, "Maxent_Rana_temporaria.rds"),
    },
    "Lissotriton helveticus": {
        "test_predictions": os.path.join(base_output_dir, "Maxent_Lissotriton_helveticus_TestPredictions.csv"),
        "model": os.path.join(base_output_dir, "Maxent_Lissotriton_helveticus.rds"),
    },
}

# Check if files exist before loading
for species, paths in maxent_files.items():
    if os.path.exists(paths["test_predictions"]):
        print(f"✅ Found: {paths['test_predictions']}")
    else:
        print(f"❌ Missing: {paths['test_predictions']}")

# Now try loading the files
maxent_test_predictions = {}
for species, paths in maxent_files.items():
    maxent_test_predictions[species] = pd.read_csv(paths["test_predictions"])

print("\n✅ MaxEnt test predictions loaded successfully!")

✅ Found: C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Maxent\Maxent_Bufo_bufo_TestPredictions.csv
✅ Found: C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Maxent\Maxent_Rana_temporaria_TestPredictions.csv
✅ Found: C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Maxent\Maxent_Lissotriton_helveticus_TestPredictions.csv

✅ MaxEnt test predictions loaded successfully!


In [7]:
maxent_evaluation_path = os.path.normpath(r"C:\GIS_Course\MScThesis-MaviSantarelli\results\Models\Maxent\Maxent_Model_Evaluation.csv")
maxent_evaluation_df = pd.read_csv(maxent_evaluation_path)
print("✅ MaxEnt model evaluation loaded successfully!")


✅ MaxEnt model evaluation loaded successfully!


### 2.6 Loading all Files and Metrics

In [8]:
import os
import joblib
import pandas as pd

# Base directories
base_dir = r"C:\GIS_Course\MScThesis-MaviSantarelli\results\Models"

# Define species list
species_list = ["Bufo bufo", "Rana temporaria", "Lissotriton helveticus"]

# Dictionary to store loaded data
models = {
    "GLM": {}, "GAM": {}, "RF": {}, "XGBoost": {}, "MaxEnt": {}
}
metrics = {
    "GLM": {}, "GAM": {}, "RF": {}, "XGBoost": {}, "MaxEnt": {}
}
test_predictions = {
    "RF": {}, "XGBoost": {}, "MaxEnt": {}  # Only these have test predictions
}

# Load GLM Models & Metrics (No Test Predictions)
glm_versions = ["Lasso", "Ridge"]
glm_dir = os.path.join(base_dir, "Final_GLM")

for species in species_list:
    models["GLM"][species] = {}
    metrics["GLM"][species] = {}

    for glm_type in glm_versions:
        model_path = os.path.join(glm_dir, "Models", f"{species}_GLM_{glm_type}_Threshold_0.3_Model.pkl")
        metrics_path = os.path.join(glm_dir, f"{species}_Threshold_0.3_Metrics.txt")
        
        if os.path.exists(model_path):
            models["GLM"][species][glm_type] = joblib.load(model_path)
        
        if os.path.exists(metrics_path):
            with open(metrics_path, "r") as f:
                metrics["GLM"][species][glm_type] = f.read()

# Load GAM Models & Metrics (No Test Predictions)
gam_dir = os.path.join(base_dir, "Final_GAM")

for species in species_list:
    model_path = os.path.join(gam_dir, f"{species}_GAM_Model_CV.pkl")
    metrics_path = os.path.join(gam_dir, f"{species}_GAM_Test_Metrics.txt")
    
    if os.path.exists(model_path):
        models["GAM"][species] = joblib.load(model_path)
    
    if os.path.exists(metrics_path):
        with open(metrics_path, "r") as f:
            metrics["GAM"][species] = f.read()

# Load RF Models, Metrics, & Test Predictions
rf_dir = os.path.join(base_dir, "RandomForest")

for species in species_list:
    species_dir = os.path.join(rf_dir, species)
    model_path = os.path.join(species_dir, "RandomForest_Model.pkl")
    metrics_path = os.path.join(species_dir, "RandomForest_Metrics.txt")
    predictions_path = os.path.join(species_dir, "Test_Predictions.csv")

    if os.path.exists(model_path):
        models["RF"][species] = joblib.load(model_path)

    if os.path.exists(metrics_path):
        with open(metrics_path, "r") as f:
            metrics["RF"][species] = f.read()

    if os.path.exists(predictions_path):
        test_predictions["RF"][species] = pd.read_csv(predictions_path)

# Load XGBoost Models, Metrics, & Test Predictions
xgb_dir = os.path.join(base_dir, "XGBoost")

for species in species_list:
    species_dir = os.path.join(xgb_dir, species)
    model_path = os.path.join(species_dir, "XGBoost_Model.pkl")
    metrics_path = os.path.join(species_dir, "XGBoost_Metrics.txt")
    predictions_path = os.path.join(species_dir, "Aggregated_Test_Predictions.csv")

    if os.path.exists(model_path):
        models["XGBoost"][species] = joblib.load(model_path)

    if os.path.exists(metrics_path):
        with open(metrics_path, "r") as f:
            metrics["XGBoost"][species] = f.read()

    if os.path.exists(predictions_path):
        test_predictions["XGBoost"][species] = pd.read_csv(predictions_path)

# Load MaxEnt Test Predictions & Model Evaluation (No Model File)
maxent_dir = os.path.join(base_dir, "Maxent")
maxent_files = {
    "Bufo bufo": "Maxent_Bufo_bufo_TestPredictions.csv",
    "Rana temporaria": "Maxent_Rana_temporaria_TestPredictions.csv",
    "Lissotriton helveticus": "Maxent_Lissotriton_helveticus_TestPredictions.csv"
}

for species, filename in maxent_files.items():
    predictions_path = os.path.join(maxent_dir, filename)
    
    if os.path.exists(predictions_path):
        test_predictions["MaxEnt"][species] = pd.read_csv(predictions_path)

maxent_metrics_path = os.path.join(maxent_dir, "Maxent_Model_Evaluation.csv")
if os.path.exists(maxent_metrics_path):
    metrics["MaxEnt"] = pd.read_csv(maxent_metrics_path)

print("\n✅ Models & Metrics Loaded Successfully!")


✅ Models & Metrics Loaded Successfully!


## 3. Model Performance Evaluation

### **Metrics used in Evaluation**:

- **Area Under the Receiver Operating Characteristic Curve (AUC-ROC)**: Measures the ability of the model to distinguish between classes.
- **Confusion Matrix:** Provides a summary of prediction results, showing true positives, true negatives, false positives, and false negatives.
- **Precision, Recall, and F1-Score:** Evaluate the accuracy of positive predictions, the ability to find all positive instances, and the balance between precision and recall, respectively.

### **Rationale:**
These metrics offer a comprehensive view of each model's performance, highlighting different aspects of predictive accuracy and error rates.

### 3.1 Generating Missing Test Prediction Files

In [9]:
import pandas as pd

# Define file paths for test data only (partitioned data)
partitioned_test_files = {
    "Bufo bufo": {
        "GLM": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Bufo bufo_GLM_subsampled_test.csv",
        "GAM": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Bufo bufo_GAM_subsampled_test.csv",
        "Maxent": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Bufo bufo_Maxent_subsampled_test.csv",
        "RF": [f"C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Bufo bufo_RF_subsampled_run{i}_test.csv" for i in range(1, 11)],
        "XGBoost": [f"C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Bufo bufo_XGBoost_subsampled_run{i}_test.csv" for i in range(1, 11)]
    },
    "Rana temporaria": {
        "GLM": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Rana temporaria_GLM_subsampled_test.csv",
        "GAM": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Rana temporaria_GAM_subsampled_test.csv",
        "Maxent": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Rana temporaria_Maxent_subsampled_test.csv",
        "RF": [f"C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Rana temporaria_RF_subsampled_run{i}_test.csv" for i in range(1, 11)],
        "XGBoost": [f"C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Rana temporaria_XGBoost_subsampled_run{i}_test.csv" for i in range(1, 11)]
    },
    "Lissotriton helveticus": {
        "GLM": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Lissotriton helveticus_GLM_subsampled_test.csv",
        "GAM": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Lissotriton helveticus_GAM_subsampled_test.csv",
        "Maxent": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Lissotriton helveticus_Maxent_subsampled_test.csv",
        "RF": [f"C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Lissotriton helveticus_RF_subsampled_run{i}_test.csv" for i in range(1, 11)],
        "XGBoost": [f"C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Lissotriton helveticus_XGBoost_subsampled_run{i}_test.csv" for i in range(1, 11)]
    }
}

# Load the test data for each species and model into a dictionary
loaded_test_data = {}

for species, models in partitioned_test_files.items():
    print(f"Loading test data for {species}...")
    loaded_test_data[species] = {}
    
    for model_name, file_paths in models.items():
        print(f"  Loading test data for model: {model_name}...")
        
        # Handle single-file models (GLM, GAM, Maxent)
        if isinstance(file_paths, str):  # Single file for test
            loaded_test_data[species][model_name] = pd.read_csv(file_paths)
        else:  # Handle iterative models (RF, XGBoost)
            loaded_test_data[species][model_name] = [pd.read_csv(file_path) for file_path in file_paths]

# Verify the structure of the loaded test data
for species, models in loaded_test_data.items():
    print(f"\nTest data loaded for {species}:")
    for model_name, data in models.items():
        if isinstance(data, list):
            print(f"  {model_name}: {len(data)} iterations of test data loaded")
        else:
            print(f"  {model_name}: Single test dataset loaded")


Loading test data for Bufo bufo...
  Loading test data for model: GLM...
  Loading test data for model: GAM...
  Loading test data for model: Maxent...
  Loading test data for model: RF...
  Loading test data for model: XGBoost...
Loading test data for Rana temporaria...
  Loading test data for model: GLM...
  Loading test data for model: GAM...
  Loading test data for model: Maxent...
  Loading test data for model: RF...
  Loading test data for model: XGBoost...
Loading test data for Lissotriton helveticus...
  Loading test data for model: GLM...
  Loading test data for model: GAM...
  Loading test data for model: Maxent...
  Loading test data for model: RF...
  Loading test data for model: XGBoost...

Test data loaded for Bufo bufo:
  GLM: Single test dataset loaded
  GAM: Single test dataset loaded
  Maxent: Single test dataset loaded
  RF: 10 iterations of test data loaded
  XGBoost: 10 iterations of test data loaded

Test data loaded for Rana temporaria:
  GLM: Single test dataset

In [13]:
def generate_predictions(model, test_df, output_path):
    print(f"⚙️ Running model on test data: {output_path}")  
    
    if test_df.empty:
        print(f"⚠️ Test dataset is empty: {output_path}")
        return
    
    test_features = test_df.drop(columns=["label"], errors="ignore")

    try:
        if hasattr(model, "predict_proba"):  # If model supports probability predictions
            predictions = model.predict_proba(test_features)
            if predictions.ndim == 1:  # If output is 1D, use directly
                predictions = predictions  
            else:
                predictions = predictions[:, 1]  # If 2D, take positive class
        
        else:  # If model does NOT support probability predictions
            predictions = model.predict(test_features)  

        test_predictions = pd.DataFrame({
            "label": test_df["label"] if "label" in test_df.columns else np.nan, 
            "prediction": predictions
        })
        test_predictions.to_csv(output_path, index=False)
        print(f"✅ Test predictions saved: {output_path}")
    
    except Exception as e:
        print(f"❌ Error generating predictions for {output_path}: {e}")

## 4. Visualisation of Model Performance

### **Steps:**
#### 1. Plot ROC Curves:
- Visualise the trade-off between true positive and false positive rates for each model.

#### 2. Compare Performance Metrics:
- Create bar plots or tables to compare precision, recall, and F1-scores across models.

### **Rationale:**
Visual representations facilitate intuitive comparisons and help identify models that perform well across various metrics.

## 5. Ensemble Modelling

### **Techniques to Explore:**

- **Weighted Averaging:** Combine model predictions by assigning weights proportional to their performance metrics.
- **Majority Voting:** For classification tasks, predict the class that receives the majority vote from individual models.

### **Rationale:**
Ensemble methods aim to harness the strengths of multiple models, often resulting in improved predictive performance and reduced variance (Meller et al., 2014). 

## 6. Evaluation of Ensemble Models

### **Steps:**
- **Compute Performance Metrics:** Assess the ensemble models using the same metrics as individual models (AUC-ROC, precision, recall, F1-score).
- **Compare to Individual Models:** Determine if the ensemble models outperform individual models.

### **Rationale:**
Evaluating ensemble models against individual models helps in understanding the added value of ensembling techniques.

## 7. Conclusion and Next Steps


## References

Meller, L., Cabeza, M., Pironon, S., Barbet-Massin, M., Maiorano, L., Georges, D., & Thuiller, W. (2014). Ensemble distribution models in conservation prioritization: From consensus predictions to consensus reserve networks. *Diversity and Distributions*, 20(3), 309–321. https://doi.org/10.1111/ddi.12162