# Model Training

## Table of Contents
1. [Introduction and Objectives](#1.-Introduction-and-Objectives)
2. [Import Required Libraries and Datasets](#2.-Import-Required-Libraries-and-Datasets)
3. [GLM Model Training](#3.1-GLM-Model-Training)





## 1. Introduction and Objectives
### 1.1 Introduction
Species distribution models (SDMs) are a key tool for understanding the relationship between species occurrences and environmental factors. They provide a means to predict species distributions across geographic areas and under varying environmental conditions. This notebook focuses on the **model training and evaluation** step, which is a critical phase in building robust SDMs. By using subsampled datasets, tailored to specific model requirements, we aim to generate accurate and ecologically meaningful predictions.

The modelling process will include a combination of regression-based approaches (GLM and GAM), machine learning models (Random Forest and XGBoost), and a presence-only method (Maxent). Each model has unique strengths, making it suitable for capturing different aspects of species-environment interactions.

### 1.2 Objectives
##### 1. **Train Species Distribution Models**:
   - Utilise subsampled datasets prepared in the previous step for each species (*Bufo bufo*, *Rana temporaria*, and *Lissotriton helveticus*).
   - Implement models specific to each approach:
     - **Generalised Linear Models (GLM)**
     - **Generalised Additive Models (GAM)**
     - **Random Forest (RF)**
     - **XGBoost**
     - **Maxent**

##### 2. **Evaluate Model Performance**:
   - Assess model accuracy and predictive power using metrics such as:
     - Area Under the Curve (AUC)
     - Accuracy
     - Sensitivity and Specificity
     - Confusion Matrix

##### 3. **Incorporate Iterative Modelling for Machine Learning Approaches**:
   - Perform 10 iterations for Random Forest and XGBoost, averaging predictions to ensure model stability and reduce variability.

##### 4. **Save Results and Outputs**:
   - Save trained models, evaluation metrics, and predictions for further analysis.
   - Export visualisations such as variable importance plots and ROC curves.

### 1.3 Expected Outcome
By the end of this notebook:
- Robust models will be trained for each species and model type.
- Performance metrics will provide insights into the predictive capacity of each approach.
- Outputs will form the basis for ecological interpretation, spatial predictions, and conservation recommendations.

This phase builds upon the carefully prepared datasets, ensuring that the models align with ecological principles and established methodologies in SDMs.

## 2. Import Required Libraries and Datasets

### 2.1 Import Libraries

In [5]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score, confusion_matrix, classification_report, roc_curve
from pygam import GAM, s, f
import matplotlib.pyplot as plt
import os
import joblib  # For saving the model

### 2.2 Load Train Data

In [23]:
import pandas as pd

# Define file paths for training data only (partitioned data)
partitioned_train_files = {
    "Bufo bufo": {
        "GLM": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Bufo bufo_GLM_subsampled_train.csv",
        "GAM": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Bufo bufo_GAM_subsampled_train.csv",
        "Maxent": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Bufo bufo_Maxent_subsampled_train.csv",
        "RF": [f"C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Bufo bufo_RF_subsampled_run{i}_train.csv" for i in range(1, 11)],
        "XGBoost": [f"C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Bufo bufo_XGBoost_subsampled_run{i}_train.csv" for i in range(1, 11)]
    },
    "Rana temporaria": {
        "GLM": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Rana temporaria_GLM_subsampled_train.csv",
        "GAM": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Rana temporaria_GAM_subsampled_train.csv",
        "Maxent": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Rana temporaria_Maxent_subsampled_train.csv",
        "RF": [f"C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Rana temporaria_RF_subsampled_run{i}_train.csv" for i in range(1, 11)],
        "XGBoost": [f"C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Rana temporaria_XGBoost_subsampled_run{i}_train.csv" for i in range(1, 11)]
    },
    "Lissotriton helveticus": {
        "GLM": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Lissotriton helveticus_GLM_subsampled_train.csv",
        "GAM": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Lissotriton helveticus_GAM_subsampled_train.csv",
        "Maxent": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Lissotriton helveticus_Maxent_subsampled_train.csv",
        "RF": [f"C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Lissotriton helveticus_RF_subsampled_run{i}_train.csv" for i in range(1, 11)],
        "XGBoost": [f"C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Lissotriton helveticus_XGBoost_subsampled_run{i}_train.csv" for i in range(1, 11)]
    }
}

# Load the training data for each species and model into a dictionary
loaded_train_data = {}

for species, models in partitioned_train_files.items():
    print(f"Loading training data for {species}...")
    loaded_train_data[species] = {}
    
    for model_name, file_paths in models.items():
        print(f"  Loading training data for model: {model_name}...")
        
        # Handle single-file models (GLM, GAM, Maxent)
        if isinstance(file_paths, str):  # Single file for training
            loaded_train_data[species][model_name] = pd.read_csv(file_paths)
        else:  # Handle iterative models (RF, XGBoost)
            loaded_train_data[species][model_name] = [pd.read_csv(file_path) for file_path in file_paths]

# Verify the structure of the loaded training data
for species, models in loaded_train_data.items():
    print(f"\nTraining data loaded for {species}:")
    for model_name, data in models.items():
        if isinstance(data, list):
            print(f"  {model_name}: {len(data)} iterations of training data loaded")
        else:
            print(f"  {model_name}: Single training dataset loaded")


Loading training data for Bufo bufo...
  Loading training data for model: GLM...
  Loading training data for model: GAM...
  Loading training data for model: Maxent...
  Loading training data for model: RF...
  Loading training data for model: XGBoost...
Loading training data for Rana temporaria...
  Loading training data for model: GLM...
  Loading training data for model: GAM...
  Loading training data for model: Maxent...
  Loading training data for model: RF...
  Loading training data for model: XGBoost...
Loading training data for Lissotriton helveticus...
  Loading training data for model: GLM...
  Loading training data for model: GAM...
  Loading training data for model: Maxent...
  Loading training data for model: RF...
  Loading training data for model: XGBoost...

Training data loaded for Bufo bufo:
  GLM: Single training dataset loaded
  GAM: Single training dataset loaded
  Maxent: Single training dataset loaded
  RF: 10 iterations of training data loaded
  XGBoost: 10 iter

### 2.3 Load Test Data

In [24]:
import pandas as pd

# Define file paths for test data only (partitioned data)
partitioned_test_files = {
    "Bufo bufo": {
        "GLM": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Bufo bufo_GLM_subsampled_test.csv",
        "GAM": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Bufo bufo_GAM_subsampled_test.csv",
        "Maxent": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Bufo bufo_Maxent_subsampled_test.csv",
        "RF": [f"C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Bufo bufo_RF_subsampled_run{i}_test.csv" for i in range(1, 11)],
        "XGBoost": [f"C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Bufo bufo_XGBoost_subsampled_run{i}_test.csv" for i in range(1, 11)]
    },
    "Rana temporaria": {
        "GLM": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Rana temporaria_GLM_subsampled_test.csv",
        "GAM": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Rana temporaria_GAM_subsampled_test.csv",
        "Maxent": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Rana temporaria_Maxent_subsampled_test.csv",
        "RF": [f"C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Rana temporaria_RF_subsampled_run{i}_test.csv" for i in range(1, 11)],
        "XGBoost": [f"C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Rana temporaria_XGBoost_subsampled_run{i}_test.csv" for i in range(1, 11)]
    },
    "Lissotriton helveticus": {
        "GLM": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Lissotriton helveticus_GLM_subsampled_test.csv",
        "GAM": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Lissotriton helveticus_GAM_subsampled_test.csv",
        "Maxent": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Lissotriton helveticus_Maxent_subsampled_test.csv",
        "RF": [f"C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Lissotriton helveticus_RF_subsampled_run{i}_test.csv" for i in range(1, 11)],
        "XGBoost": [f"C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Lissotriton helveticus_XGBoost_subsampled_run{i}_test.csv" for i in range(1, 11)]
    }
}

# Load the test data for each species and model into a dictionary
loaded_test_data = {}

for species, models in partitioned_test_files.items():
    print(f"Loading test data for {species}...")
    loaded_test_data[species] = {}
    
    for model_name, file_paths in models.items():
        print(f"  Loading test data for model: {model_name}...")
        
        # Handle single-file models (GLM, GAM, Maxent)
        if isinstance(file_paths, str):  # Single file for test
            loaded_test_data[species][model_name] = pd.read_csv(file_paths)
        else:  # Handle iterative models (RF, XGBoost)
            loaded_test_data[species][model_name] = [pd.read_csv(file_path) for file_path in file_paths]

# Verify the structure of the loaded test data
for species, models in loaded_test_data.items():
    print(f"\nTest data loaded for {species}:")
    for model_name, data in models.items():
        if isinstance(data, list):
            print(f"  {model_name}: {len(data)} iterations of test data loaded")
        else:
            print(f"  {model_name}: Single test dataset loaded")


Loading test data for Bufo bufo...
  Loading test data for model: GLM...
  Loading test data for model: GAM...
  Loading test data for model: Maxent...
  Loading test data for model: RF...
  Loading test data for model: XGBoost...
Loading test data for Rana temporaria...
  Loading test data for model: GLM...
  Loading test data for model: GAM...
  Loading test data for model: Maxent...
  Loading test data for model: RF...
  Loading test data for model: XGBoost...
Loading test data for Lissotriton helveticus...
  Loading test data for model: GLM...
  Loading test data for model: GAM...
  Loading test data for model: Maxent...
  Loading test data for model: RF...
  Loading test data for model: XGBoost...

Test data loaded for Bufo bufo:
  GLM: Single test dataset loaded
  GAM: Single test dataset loaded
  Maxent: Single test dataset loaded
  RF: 10 iterations of test data loaded
  XGBoost: 10 iterations of test data loaded

Test data loaded for Rana temporaria:
  GLM: Single test dataset

## 3. Model Training

### **3.1 GLM Model Training**

In [25]:
import os
import joblib
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, confusion_matrix, classification_report, roc_curve

# Directory to save results
output_dir = "C:/GIS_Course/MScThesis-MaviSantarelli/results/GLM"
os.makedirs(output_dir, exist_ok=True)

# Loop through each species for GLM training
for species in loaded_train_data.keys():
    print(f"Training GLM for {species}...")

    # Get the GLM training dataset for the species
    data = loaded_train_data[species]["GLM"]
    X = data.drop(columns=["label"])  # Predictors
    y = data["label"]  # Response variable (presence/absence)

    # Split the data into training (70%) and testing (30%) subsets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # Train the GLM (Logistic Regression)
    glm = LogisticRegression(max_iter=1000, random_state=42)
    glm.fit(X_train, y_train)

    # Evaluate the model on the test set
    y_pred = glm.predict(X_test)
    y_pred_prob = glm.predict_proba(X_test)[:, 1]

    # Calculate evaluation metrics
    roc_auc = roc_auc_score(y_test, y_pred_prob)
    conf_matrix = confusion_matrix(y_test, y_pred)
    class_report = classification_report(y_test, y_pred)

    print(f"  ROC-AUC for {species}: {roc_auc:.3f}")
    print(f"  Confusion Matrix:\n{conf_matrix}")
    print(f"  Classification Report:\n{class_report}")

    # Plot the ROC curve
    fpr, tpr, _ = roc_curve(y_test, y_pred_prob)
    plt.figure()
    plt.plot(fpr, tpr, label=f"ROC Curve (AUC = {roc_auc:.3f})")
    plt.plot([0, 1], [0, 1], 'k--', label="Random Guess")
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title(f"ROC Curve for {species} (GLM)")
    plt.legend()
    plt.savefig(f"{output_dir}/{species}_GLM_ROC_Curve.png")
    plt.close()

    # Save the model
    model_path = f"{output_dir}/{species}_GLM_Model.pkl"
    joblib.dump(glm, model_path)
    print(f"  Model saved for {species} at {model_path}")

    # Save evaluation metrics
    metrics_path = f"{output_dir}/{species}_GLM_Metrics.txt"
    with open(metrics_path, "w") as f:
        f.write(f"ROC-AUC: {roc_auc:.3f}\n")
        f.write(f"Confusion Matrix:\n{conf_matrix}\n")
        f.write(f"Classification Report:\n{class_report}\n")
    print(f"  Metrics saved for {species} at {metrics_path}")

print("GLM training and evaluation complete!")


Training GLM for Bufo bufo...
  ROC-AUC for Bufo bufo: 0.828
  Confusion Matrix:
[[1178    8]
 [ 135    6]]
  Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.99      0.94      1186
           1       0.43      0.04      0.08       141

    accuracy                           0.89      1327
   macro avg       0.66      0.52      0.51      1327
weighted avg       0.85      0.89      0.85      1327

  Model saved for Bufo bufo at C:/GIS_Course/MScThesis-MaviSantarelli/results/GLM/Bufo bufo_GLM_Model.pkl
  Metrics saved for Bufo bufo at C:/GIS_Course/MScThesis-MaviSantarelli/results/GLM/Bufo bufo_GLM_Metrics.txt
Training GLM for Rana temporaria...
  ROC-AUC for Rana temporaria: 0.820
  Confusion Matrix:
[[2465   23]
 [ 209   36]]
  Classification Report:
              precision    recall  f1-score   support

           0       0.92      0.99      0.96      2488
           1       0.61      0.15      0.24       245

    accuracy   

## **Summary of GLM Results**

### Bufo bufo
- **ROC-AUC**: 0.828 (Good discriminatory power)
- **Precision (label=1)**: 0.43
- **Recall (label=1)**: 0.04 (Low, indicating a high number of false negatives)
- **Overall Accuracy**: 0.89
- **Notes**: The model performs well in predicting absences but struggles to accurately predict presences, leading to low sensitivity. Adjustments may be necessary to improve recall.

### Rana temporaria
- **ROC-AUC**: 0.820 (Good discriminatory power)
- **Precision (label=1)**: 0.61
- **Recall (label=1)**: 0.15 (Low recall, but better than *Bufo bufo*)
- **Overall Accuracy**: 0.92
- **Notes**: The model shows strong performance but is still biased towards predicting absences. Improvements to increase sensitivity to presences could further enhance results.

### Lissotriton helveticus
- **ROC-AUC**: 0.809 (Good discriminatory power)
- **Precision (label=1)**: 1.00
- **Recall (label=1)**: 0.01 (Very low recall)
- **Overall Accuracy**: 0.90
- **Notes**: Similar to *Bufo bufo*, the model is effective at predicting absences but struggles significantly to identify presences. This highlights a strong class imbalance in predictions.

## Overall Observations
- **Strengths**:
  - All models show good overall discriminatory power (ROC-AUC > 0.8).
  - High accuracy, primarily driven by correct absence predictions.
- **Weaknesses**:
  - All models exhibit low recall for presences, indicating challenges in correctly identifying presence points.
  - Imbalanced datasets may have influenced these results, leading to models biased towards absences.

## **Recommendations for Improvement**

- **Address Class Imbalance**:
  - Oversample presences or undersample absences to balance the dataset.
- **Feature Analysis**:
  - Evaluate feature importance to identify and remove less relevant predictors.
- **Model Refinements**:
  - Perform hyperparameter tuning for the logistic regression model (e.g., penalty type, solver).
- **Iterate on Metrics**:
  - Focus on improving recall and F1-score for presence predictions to create more balanced models.


---

### Step 1: Address Class Imbalance
Use `class_weight='balanced'` to improve recall for the minority class. This will ensure the presences are given more importance during training.

#### **Rationale**:

#### 1. Dynamic Weight Assignment
The `class_weight='balanced'` parameter dynamically assigns weights to classes (presence and pseudo-absence) based on their frequency in the training data. This approach ensures that the minority class (presence) is given more influence during training, effectively improving recall for presences while maintaining a balance between the contributions of pseudo-absences and presences.

#### 2. Minimising Overfitting to the Majority Class
Without class weighting, logistic regression models tend to focus disproportionately on the majority class (pseudo-absences), which dominates the dataset. This imbalance leads to poor recall for presences and a higher false-negative rate. By implementing `class_weight='balanced'`, the issue is mitigated as the model's loss function adjusts for the unequal class distribution, enhancing its ability to detect presences accurately.

#### 3. Compatibility with Logistic Regression
Logistic regression, as a linear model, can face challenges with imbalanced datasets, often underperforming on the minority class. The `class_weight='balanced'` parameter is designed to address this limitation by ensuring that the imbalance in class frequencies does not overly influence the model's decision boundary, resulting in a more robust and fairer classification.

#### 4. Alignment with Pseudo-Absence Generation Strategy
The pseudo-absence generation methodology, which incorporates ecological buffers, already reduces potential biases in the absence data. Applying class weighting further complements this strategy by addressing statistical imbalances between presences and pseudo-absences. This combined approach ensures that the logistic regression model is optimised for the specific characteristics of the dataset without introducing additional ecological or statistical biases.


In [27]:
import os
import joblib
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, confusion_matrix, classification_report, roc_curve

# Directory to save results
output_dir = "C:/GIS_Course/MScThesis-MaviSantarelli/results/GLM"
os.makedirs(output_dir, exist_ok=True)

# Loop through each species for GLM training
for species in loaded_train_data.keys():
    print(f"Training GLM for {species}...")

    # Get the GLM training dataset for the species
    data = loaded_train_data[species]["GLM"]
    X = data.drop(columns=["label"])  # Predictors
    y = data["label"]  # Response variable (presence/absence)

    # Split the data into training (70%) and testing (30%) subsets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # Train the GLM with class_weight='balanced' to address class imbalance
    glm = LogisticRegression(max_iter=1000, random_state=42, class_weight='balanced')
    glm.fit(X_train, y_train)

    # Evaluate the model on the test set
    y_pred = glm.predict(X_test)
    y_pred_prob = glm.predict_proba(X_test)[:, 1]

    # Calculate evaluation metrics
    roc_auc = roc_auc_score(y_test, y_pred_prob)
    conf_matrix = confusion_matrix(y_test, y_pred)
    class_report = classification_report(y_test, y_pred)

    print(f"  ROC-AUC for {species}: {roc_auc:.3f}")
    print(f"  Confusion Matrix:\n{conf_matrix}")
    print(f"  Classification Report:\n{class_report}")

    # Plot the ROC curve
    fpr, tpr, _ = roc_curve(y_test, y_pred_prob)
    plt.figure()
    plt.plot(fpr, tpr, label=f"ROC Curve (AUC = {roc_auc:.3f})")
    plt.plot([0, 1], [0, 1], 'k--', label="Random Guess")
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title(f"ROC Curve for {species} (GLM)")
    plt.legend()
    plt.savefig(f"{output_dir}/{species}_GLM_ROC_Curve.png")
    plt.close()

    # Save the model
    model_path = f"{output_dir}/{species}_GLM_Model.pkl"
    joblib.dump(glm, model_path)
    print(f"  Model saved for {species} at {model_path}")

    # Save evaluation metrics
    metrics_path = f"{output_dir}/{species}_GLM_Metrics.txt"
    with open(metrics_path, "w") as f:
        f.write(f"ROC-AUC: {roc_auc:.3f}\n")
        f.write(f"Confusion Matrix:\n{conf_matrix}\n")
        f.write(f"Classification Report:\n{class_report}\n")
    print(f"  Metrics saved for {species} at {metrics_path}")

print("GLM training and evaluation with class imbalance correction complete!")


Training GLM for Bufo bufo...
  ROC-AUC for Bufo bufo: 0.829
  Confusion Matrix:
[[859 327]
 [ 30 111]]
  Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.72      0.83      1186
           1       0.25      0.79      0.38       141

    accuracy                           0.73      1327
   macro avg       0.61      0.76      0.61      1327
weighted avg       0.89      0.73      0.78      1327

  Model saved for Bufo bufo at C:/GIS_Course/MScThesis-MaviSantarelli/results/GLM/Bufo bufo_GLM_Model.pkl
  Metrics saved for Bufo bufo at C:/GIS_Course/MScThesis-MaviSantarelli/results/GLM/Bufo bufo_GLM_Metrics.txt
Training GLM for Rana temporaria...
  ROC-AUC for Rana temporaria: 0.819
  Confusion Matrix:
[[1994  494]
 [  73  172]]
  Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.80      0.88      2488
           1       0.26      0.70      0.38       245

    accuracy       

## **Summary of GLM Results and Observations**

### Overview
The Generalised Linear Models (GLM) were trained and evaluated for each species using class imbalance correction (`class_weight='balanced'`). The results show significant improvement in recall for the minority class (presences). However, there is still room for improvement, especially in precision for presences. The models performed better at predicting absences but struggled with correctly identifying presences.

### General Observations
##### 1. **Improved Recall**:
 - The class imbalance adjustment significantly improved recall for presences, achieving the following recall scores:
   - **Bufo bufo**: 0.79
   - **Rana temporaria**: 0.70
   - **Lissotriton helveticus**: 0.73
 - Accurate identification of presences is crucial for species distribution modeling, and the increased recall helps meet this ecological objective.

##### 2. **Low Precision for Presences**:
 - All models exhibit low precision for presences, indicating a high rate of false positives. Specifically:
   - **Bufo bufo**: Precision for presences = 0.25
   - **Rana temporaria**: Precision for presences = 0.26
   - **Lissotriton helveticus**: Precision for presences = 0.24
 - This means the models overestimate the presence of species, which could lead to an inflated estimate of the species' potential range.

##### 3. **Strong ROC-AUC Scores**:
 - The models consistently achieved **good ROC-AUC scores**:
   - **Bufo bufo**: 0.829
   - **Rana temporaria**: 0.819
   - **Lissotriton helveticus**: 0.809
 - These scores demonstrate strong overall performance, with the models being effective in distinguishing between presence and absence.

### Required Adjustments
To further refine the models and improve their performance, the following steps will be performed:

##### 1. **Threshold Adjustment**:
 - Adjust the decision threshold (default is 0.5) to better balance precision and recall for presences.
 - Generate **precision-recall curves** to identify the optimal threshold for minimizing false positives while maintaining good recall.

##### 2. **Hyperparameter Tuning**:
 - Experiment with the **regularisation strength (`C`)** in logistic regression to further optimise performance, particularly for improving precision.
 
##### 3. **Feature Selection**:
 - Reassess the **predictors** used in the model and exclude less informative variables to reduce noise and improve model robustness.

##### 4. **Precision-Recall Trade-Offs**:
 - Evaluate precision-recall trade-offs in ecological applications, where false positives and false negatives may have different ecological consequences. This will help identify an acceptable trade-off betweents or if you need help with the next steps!


---

### Step 2: Addressing multicollinearity

Addressing multicollinearity is the next most important step after tackling class imbalances, especially for models like logistic regression (GLM) that are sensitive to highly correlated predictors. Here's why:

1. **Class imbalance correction** ensures that presences aren't overshadowed by pseudo-absences, while multicollinearity checks ensure the predictors are independent and interpretable.
2. **Multicollinearity** can cause instability in the model coefficients, leading to unreliable predictions, even after addressing class imbalances.
3. **Redundant predictors** can reduce model generalisability, which is critical for your study's aim of reliable species distribution predictions.

In [28]:
# Compute correlation matrix
correlation_matrix = X_train.corr()

# Identify highly correlated features
threshold = 0.75  # Define a threshold for high correlation
high_corr_pairs = correlation_matrix.abs().unstack().sort_values(ascending=False)
high_corr_pairs = high_corr_pairs[high_corr_pairs >= threshold]
print("Highly Correlated Feature Pairs:\n", high_corr_pairs)


Highly Correlated Feature Pairs:
 C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Building_Density_Reversed.tif                 C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Building_Density_Reversed.tif                   1.00000
C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/DistWater_Reversed.tif                        C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/DistWater_Reversed.tif                          1.00000
C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/NDVI_StDev.tif                                         C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/NDVI_StDev.tif                                           1.00000
C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/NDVI_median.tif                                        C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/NDVI_median.tif                                          1.0000

---

### **Key Observations**
### 1. Perfect Correlation (Correlation = 1):
- Predictors like `Building_Density_Reversed.tif`, `DistWater_Reversed.tif`, `VegHeight.tif`, etc., are perfectly correlated with themselves.
- This is expected because a feature is always perfectly correlated with itself. These rows can be ignored.

### 2. High Correlation Between Different Predictors:
- Example: `Traffic_Reversed.tif` and `NOx_Stand_Reversed.tif` have a correlation of 0.889, suggesting they are highly redundant.
- High correlations between different predictors indicate potential multicollinearity, which can destabilise the logistic regression model.

In [29]:
high_corr_pairs = high_corr_pairs[high_corr_pairs.index.get_level_values(0) != high_corr_pairs.index.get_level_values(1)]
print("Highly Correlated Predictor Pairs (Excluding Self-Correlation):\n", high_corr_pairs)


Highly Correlated Predictor Pairs (Excluding Self-Correlation):
 C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Traffic_Reversed.tif    C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/NOx_Stand_Reversed.tif    0.89834
C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/NOx_Stand_Reversed.tif  C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Traffic_Reversed.tif      0.89834
dtype: float64


The refined results indicate a high correlation (**0.889**) between `Traffic_Reversed.tif` and `NOx_Stand_Reversed.tif`. This redundancy should be addressed to improve the logistic regression model and avoid multicollinearity issues.

In [30]:
# Identify the predictor to retain and the one to drop
predictor_to_retain = "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Traffic_Reversed.tif"
predictor_to_drop = "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/NOx_Stand_Reversed.tif"

# Drop the less relevant predictor (NOx emissions) from the dataset
X_train = X_train.drop(columns=[predictor_to_drop], errors='ignore')
X_test = X_test.drop(columns=[predictor_to_drop], errors='ignore')

print(f"Predictor '{predictor_to_drop}' has been removed from the dataset.")
print(f"Retained Predictor: '{predictor_to_retain}'")

# Verify the predictors in the dataset after removal
print("\nUpdated Predictors in X_train:")
print(X_train.columns)

print("\nUpdated Predictors in X_test:")
print(X_test.columns)


Predictor 'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/NOx_Stand_Reversed.tif' has been removed from the dataset.
Retained Predictor: 'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Traffic_Reversed.tif'

Updated Predictors in X_train:
Index(['C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Building_Density_Reversed.tif',
       'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/DistWater_Reversed.tif',
       'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/RGS_Reversed.tif',
       'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Runoff_Coefficient_Standardised_Reversed.tif',
       'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Slope_Proj_Reversed.tif',
       'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/SoilMoisture_32bit_Reversed.tif',
       'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/I

### **Rationale for Predictor Selection**

During the refinement of the Generalized Linear Model (GLM) for species distribution, high multicollinearity was identified between two traffic-related predictors: **traffic intensity** and **nitrogen oxide (NOx) emissions**. To enhance model stability and interpretability, it was necessary to select the most ecologically relevant predictor.

#### Ecological Significance of Road-Related Mortality
Amphibians are particularly vulnerable to road-induced mortality during migration and dispersal phases. The physical barriers imposed by roads and vehicular traffic result in significant population declines and habitat fragmentation. This phenomenon has been well-documented in ecological studies, emphasizing the critical impact of roads on amphibian survival (Glista et al., 2008). 

#### Addressing Multicollinearity in Predictors
Multicollinearity between predictors can lead to unstable coefficient estimates, inflated standard errors, and diminished predictive accuracy in regression models. Retaining highly correlated predictors can introduce redundancy and reduce the reliability of model outputs. To ensure robust parameter estimation, it is essential to address multicollinearity (Graham, 2003).

#### Selection of Traffic Intensity as the Key Predictor
Based on ecological relevance and statistical considerations, **traffic intensity** was retained as the predictor representing road-related mortality. Traffic intensity directly measures a key source of mortality for amphibians, while NOx emissions, though correlated, do not provide as direct a connection to mortality rates. This decision aligns with ecological evidence and ensures the inclusion of a meaningful variable in the model.


In [32]:
# Directory to save results
output_dir = "C:/GIS_Course/MScThesis-MaviSantarelli/results/GLM"
os.makedirs(output_dir, exist_ok=True)

# Loop through each species for GLM training
for species in loaded_train_data.keys():
    print(f"Retraining GLM for {species} with adjusted predictors...")
    
    # Get the GLM dataset for the species
    data = loaded_train_data[species]["GLM"]
    X = data.drop(columns=["label"])  # Predictors
    y = data["label"]  # Response variable (presence/absence)
    
    # Remove the multicollinear predictor (NOx_Stand_Reversed.tif) from predictors
    predictor_to_remove = "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/NOx_Stand_Reversed.tif"
    if predictor_to_remove in X.columns:
        X = X.drop(columns=[predictor_to_remove])
        print(f"Removed predictor: {predictor_to_remove}")

    # Split the data into training (70%) and testing (30%) subsets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # Train the GLM with class_weight='balanced'
    glm = LogisticRegression(max_iter=1000, random_state=42, class_weight='balanced')
    glm.fit(X_train, y_train)

    # Evaluate the model on the test set
    y_pred = glm.predict(X_test)
    y_pred_prob = glm.predict_proba(X_test)[:, 1]

    # Calculate evaluation metrics
    roc_auc = roc_auc_score(y_test, y_pred_prob)
    conf_matrix = confusion_matrix(y_test, y_pred)
    class_report = classification_report(y_test, y_pred)

    print(f"  ROC-AUC for {species}: {roc_auc:.3f}")
    print(f"  Confusion Matrix:\n{conf_matrix}")
    print(f"  Classification Report:\n{class_report}")

    # Plot the ROC curve
    fpr, tpr, _ = roc_curve(y_test, y_pred_prob)
    plt.figure()
    plt.plot(fpr, tpr, label=f"ROC Curve (AUC = {roc_auc:.3f})")
    plt.plot([0, 1], [0, 1], 'k--', label="Random Guess")
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title(f"ROC Curve for {species} (GLM with Adjusted Predictors)")
    plt.legend()
    plt.savefig(f"{output_dir}/{species}_GLM_ROC_Curve_Adjusted.png")
    plt.close()

    # Save the model
    model_path = f"{output_dir}/{species}_GLM_Model_Adjusted.pkl"
    joblib.dump(glm, model_path)
    print(f"  Adjusted model saved for {species} at {model_path}")

    # Save evaluation metrics
    metrics_path = f"{output_dir}/{species}_GLM_Metrics_Adjusted.txt"
    with open(metrics_path, "w") as f:
        f.write(f"ROC-AUC: {roc_auc:.3f}\n")
        f.write(f"Confusion Matrix:\n{conf_matrix}\n")
        f.write(f"Classification Report:\n{class_report}\n")
    print(f"  Adjusted metrics saved for {species} at {metrics_path}")

print("GLM retraining with adjusted predictors complete!")


Retraining GLM for Bufo bufo with adjusted predictors...
Removed predictor: C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/NOx_Stand_Reversed.tif
  ROC-AUC for Bufo bufo: 0.829
  Confusion Matrix:
[[856 330]
 [ 30 111]]
  Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.72      0.83      1186
           1       0.25      0.79      0.38       141

    accuracy                           0.73      1327
   macro avg       0.61      0.75      0.60      1327
weighted avg       0.89      0.73      0.78      1327

  Adjusted model saved for Bufo bufo at C:/GIS_Course/MScThesis-MaviSantarelli/results/GLM/Bufo bufo_GLM_Model_Adjusted.pkl
  Adjusted metrics saved for Bufo bufo at C:/GIS_Course/MScThesis-MaviSantarelli/results/GLM/Bufo bufo_GLM_Metrics_Adjusted.txt
Retraining GLM for Rana temporaria with adjusted predictors...
Removed predictor: C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/N

### **Interpretation of Results**

#### 1. **Performance Consistency**:
- The removal of the multicollinear predictor (`NOx_Stand_Reversed.tif`) had **minimal to no impact** on **ROC-AUC** and other performance metrics. This suggests that the predictor did not provide unique or critical information for improving predictions, confirming its redundancy due to multicollinearity with `Traffic_Reversed.tif`.

#### 2. **High False Positives**:
- For all species, **false positives** (predicting presence where it is not) remain relatively high. This could be a result of noise in the **pseudo-absence data** or limited predictive power of the GLM to distinguish presence and pseudo-absence.

#### 3. **Improvements in Recall**:
- **Recall** (sensitivity) for the presence class (Class 1) **remains high** across species, which is crucial for **conservation-focused studies** where capturing presence is more important than absolute accuracy. The high recall indicates a successful adjustment, improving the model’s sensitivity to the minority class (presences).

#### 4. **Challenges with Precision**:
- **Precision** (proportion of correctly identified presences out of total predicted presences) remains **low**, indicating that many of the predicted presences are **false positives**.



---

### Step 3: Adding Regularisation

In [33]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import cross_val_score

# Directory to save results
output_dir = "C:/GIS_Course/MScThesis-MaviSantarelli/results/GLM_Regularised"
os.makedirs(output_dir, exist_ok=True)

# Loop through each species for Ridge and Lasso regularisation
for species, models in loaded_train_data.items():  # Use loaded_train_data instead of loaded_data
    print(f"Training GLM with regularisation for {species}...")
    
    # Get the GLM training dataset for the species
    data = models["GLM"]
    X = data.drop(columns=["label"])  # Predictors
    y = data["label"]  # Response variable (presence/absence)

    # Split the data into training (70%) and testing (30%) subsets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # Define Ridge and Lasso models with cross-validation
    ridge_model = LogisticRegressionCV(cv=5, penalty='l2', solver='liblinear', max_iter=1000, scoring='roc_auc', class_weight='balanced', random_state=42)
    lasso_model = LogisticRegressionCV(cv=5, penalty='l1', solver='liblinear', max_iter=1000, scoring='roc_auc', class_weight='balanced', random_state=42)

    # Train Ridge model
    ridge_model.fit(X_train, y_train)
    ridge_y_pred = ridge_model.predict(X_test)
    ridge_y_pred_prob = ridge_model.predict_proba(X_test)[:, 1]

    # Train Lasso model
    lasso_model.fit(X_train, y_train)
    lasso_y_pred = lasso_model.predict(X_test)
    lasso_y_pred_prob = lasso_model.predict_proba(X_test)[:, 1]

    # Evaluate Ridge model
    ridge_roc_auc = roc_auc_score(y_test, ridge_y_pred_prob)
    ridge_conf_matrix = confusion_matrix(y_test, ridge_y_pred)
    ridge_class_report = classification_report(y_test, ridge_y_pred)

    # Evaluate Lasso model
    lasso_roc_auc = roc_auc_score(y_test, lasso_y_pred_prob)
    lasso_conf_matrix = confusion_matrix(y_test, lasso_y_pred)
    lasso_class_report = classification_report(y_test, lasso_y_pred)

    # Save Ridge results
    ridge_model_path = f"{output_dir}/{species}_GLM_Ridge_Model.pkl"
    joblib.dump(ridge_model, ridge_model_path)
    print(f"  Ridge model saved for {species} at {ridge_model_path}")
    
    ridge_metrics_path = f"{output_dir}/{species}_GLM_Ridge_Metrics.txt"
    with open(ridge_metrics_path, "w") as f:
        f.write(f"ROC-AUC: {ridge_roc_auc:.3f}\n")
        f.write(f"Confusion Matrix:\n{ridge_conf_matrix}\n")
        f.write(f"Classification Report:\n{ridge_class_report}\n")
    print(f"  Ridge metrics saved for {species} at {ridge_metrics_path}")

    # Save Lasso results
    lasso_model_path = f"{output_dir}/{species}_GLM_Lasso_Model.pkl"
    joblib.dump(lasso_model, lasso_model_path)
    print(f"  Lasso model saved for {species} at {lasso_model_path}")
    
    lasso_metrics_path = f"{output_dir}/{species}_GLM_Lasso_Metrics.txt"
    with open(lasso_metrics_path, "w") as f:
        f.write(f"ROC-AUC: {lasso_roc_auc:.3f}\n")
        f.write(f"Confusion Matrix:\n{lasso_conf_matrix}\n")
        f.write(f"Classification Report:\n{lasso_class_report}\n")
    print(f"  Lasso metrics saved for {species} at {lasso_metrics_path}")

    # Print summary
    print(f"  Ridge ROC-AUC for {species}: {ridge_roc_auc:.3f}")
    print(f"  Lasso ROC-AUC for {species}: {lasso_roc_auc:.3f}")

print("GLM training with regularisation complete!")


Training GLM with regularisation for Bufo bufo...
  Ridge model saved for Bufo bufo at C:/GIS_Course/MScThesis-MaviSantarelli/results/GLM_Regularised/Bufo bufo_GLM_Ridge_Model.pkl
  Ridge metrics saved for Bufo bufo at C:/GIS_Course/MScThesis-MaviSantarelli/results/GLM_Regularised/Bufo bufo_GLM_Ridge_Metrics.txt
  Lasso model saved for Bufo bufo at C:/GIS_Course/MScThesis-MaviSantarelli/results/GLM_Regularised/Bufo bufo_GLM_Lasso_Model.pkl
  Lasso metrics saved for Bufo bufo at C:/GIS_Course/MScThesis-MaviSantarelli/results/GLM_Regularised/Bufo bufo_GLM_Lasso_Metrics.txt
  Ridge ROC-AUC for Bufo bufo: 0.831
  Lasso ROC-AUC for Bufo bufo: 0.830
Training GLM with regularisation for Rana temporaria...
  Ridge model saved for Rana temporaria at C:/GIS_Course/MScThesis-MaviSantarelli/results/GLM_Regularised/Rana temporaria_GLM_Ridge_Model.pkl
  Ridge metrics saved for Rana temporaria at C:/GIS_Course/MScThesis-MaviSantarelli/results/GLM_Regularised/Rana temporaria_GLM_Ridge_Metrics.txt
  La

---

### **Analysis of Results**

#### 1. **Impact of Regularisation**:
- Regularisation stabilises the models by reducing overfitting and multicollinearity.
- **Ridge** consistently performs slightly better than **Lasso** for all species, although the difference is marginal. 
    - **Bufo bufo**: Ridge performs better than Lasso, with a slight improvement in ROC-AUC (0.831 vs 0.830).
    - **Rana temporaria**: Both Ridge and Lasso have identical performance, with a ROC-AUC of 0.821.
    - **Lissotriton helveticus**: Lasso performs slightly better than Ridge, with a ROC-AUC of 0.812 compared to 0.809.

#### 2. **Performance Consistency**:
- The regularisation techniques do not drastically improve the **ROC-AUC** scores compared to the unregularised GLM. However, they help **ensure better model generalisation**, particularly when dealing with overfitting and multicollinearity. The performance improvement is more subtle but valuable for improving model stability.

#### 3. **Species-Specific Observations**:
- ***Bufo bufo***: **Ridge** slightly outperforms **Lasso**, with the ROC-AUC of 0.831 compared to 0.830, suggesting that **Ridge** provides a more stable and consistent performance for this species.
- ***Rana temporaria***: The model is already robust, as evidenced by the unchanged performance (both Ridge and Lasso achieve a ROC-AUC of 0.821). Regularisation does not significantly affect the model’s effectiveness for this species.
- ***Lissotriton helveticus***: **Lasso** marginally outperforms **Ridge** (0.812 vs 0.809), suggesting that **Lasso’s** feature selection ability may be beneficial for this species, even though it slightly underperforms i feature selection and stability.


---

### Step 4: Hyperparameter Tuning


In this step, we use **Grid Search** to tune the regularization strength (`C`) for the **Ridge (L2)** and **Lasso (L1)** regularized logistic regression models. Regularization helps prevent overfitting by adding a penalty term to the model, controlling the complexity of the learned coefficients.

The parameter `C` (or `Cs` in **LogisticRegressionCV**) controls the regularization strength:
- **Lower values of `C`** indicate **stronger regularization** (penalty), resulting in a simpler model with fewer non-zero coefficients.
- **Higher values of `C`** indicate **weaker regularization**, allowing the model to fit the training data more closely.

By using **GridSearchCV**, we explore a range of possible values for `C` (from `0.01` to `100`) to find the optimal balance between model complexity and performance. The **ROC-AUC** score is used as the evaluation metric during cross-validation to determine the best regularization strength.

This tuning step ensures that the model is neither overfitting nor underfitting, improving its generalizatioability.
.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegressionCV

# Define parameter grid for regularisation strength
param_grid = {
    'Cs': [[0.01, 0.1, 1, 10, 100]]  # Range of C values to explore (as an array)
}

# Grid search for Ridge
ridge_grid_search = GridSearchCV(LogisticRegressionCV(cv=5, penalty='l2', solver='liblinear', max_iter=1000, scoring='roc_auc', class_weight='balanced', random_state=42), param_grid, cv=5)
ridge_grid_search.fit(X_train, y_train)
print("Best Ridge Model:", ridge_grid_search.best_params_)

# Grid search for Lasso
lasso_grid_search = GridSearchCV(LogisticRegressionCV(cv=5, penalty='l1', solver='liblinear', max_iter=1000, scoring='roc_auc', class_weight='balanced', random_state=42), param_grid, cv=5)
lasso_grid_search.fit(X_train, y_train)
print("Best Lasso Model:", lasso_grid_search.best_params_)


## References
- Glista, D. J., DeVault, T. L., & DeWoody, J. A. (2008). A review of mitigation measures for reducing wildlife mortality on roadways. *Herpetological Conservation and Biology, 3*(1), 16–28. Retrieved from [https://www.herpconbio.org/Volume_3/Issue_1/Glista_etal_2008.pdf](https://www.herpconbio.org/Volume_3/Issue_1/Glista_etal_2008.pdf)

- Graham, M. H. (2003). Confronting multicollinearity in ecological multiple regression. *Ecology, 84*(11), 2809–2815. Retrieved from [https://webhome.auburn.edu/~tds0009/Articles/Graham%202003.pdf](https://webhome.auburn.edu/~tds0009/Articles/Graham%202003.pdf)