# Addressing Sampling Bias in Species Distribution Models (SDMs)

## Table of Content

1. [Introduction](#1.-Introduction)
2. [Model Re-runs Using Bias Files](#2.-Model-Re-runs-Using-Bias-Files)
3. [Generate SDM Predictions](#3.-Generate-SDM-Predictions)

## 1. Introduction
Species distribution models (SDMs) are essential tools for predicting the potential distribution of species based on environmental variables and occurrence data. These models play a critical role in conservation planning, biodiversity management, and understanding ecological processes. However, the accuracy and reliability of SDMs can be significantly compromised by **sampling bias**—a common issue where data collection efforts are unevenly distributed across the study area, often due to easier access, proximity to research institutions, or observer preferences.

### 1.1 The Problem of Sampling Bias

Sampling bias occurs when certain areas within a study region are surveyed more intensively than others, leading to overrepresentation of species presence in those locations. This can result in models that inaccurately predict species distributions, reflecting survey effort rather than true ecological patterns. For example, urban areas or regions near research facilities may have more data points simply due to higher human activity and accessibility, while remote or rural areas remain underrepresented. This issue is particularly problematic in studies aiming to inform conservation strategies, as it may lead to the neglect of critical habitats that are under-surveyed.

### 1.2 Strategies to Mitigate Sampling Bias

To improve the robustness of SDMs and ensure that model outputs more accurately reflect true species distributions, several methods can be employed to address sampling bias:

1. **Geographic and Environmental Thinning**: This method involves filtering occurrence data to reduce spatial clustering, ensuring a more even distribution of data points across the study area. Geographic thinning selects records that are spatially separated, while environmental thinning ensures that data points represent a broad range of environmental conditions. Both approaches have been shown to effectively reduce the impact of sampling bias (Redding et al., 2021).

2. **Using Bias Files in Modeling Algorithms**: Incorporating bias files into algorithms like Maxent allows models to account for uneven survey efforts by weighting background points based on survey intensity. This helps to adjust predictions and improve model performance, leading to more accurate distribution maps (Phillips et al., 2009).

3. **Pooling Presence-Only and Presence-Absence Data**: Combining different types of data can help correct sampling bias by leveraging the strengths of each dataset. Probabilistic models that integrate presence-only and presence-absence data across multiple species can jointly analyze the data, adjusting for bias and improving estimation efficiency (Fithian et al., 2015).

4. **Model-Based Approaches to Sampling Bias**: Explicitly modeling the sampling process by including survey effort as a covariate can improve SDM accuracy. This approach acknowledges the non-random nature of data collection and adjusts predictions based on modeled sampling patterns (Robinson et al., 2017).

By implementing these strategies, researchers can enhance the reliability of SDMs, ensuring that predictions are driven by ecological factors rather than artifacts of data collection.



## 2. Model Re-runs Using Bias Files
### 2.1 Load Required Libraries and Files

In [36]:
import geopandas as gpd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from scipy.stats import gaussian_kde
import rasterio
from rasterio.transform import from_origin
import os

In [37]:
# Define KDE bias layer paths for each species
bias_layers = {
    "Bufo_bufo": "C:/GIS_Course/MScThesis-MaviSantarelli/results/bias_layers/Bufo_bufo_bias_layer.tif",
    "Rana_temporaria": "C:/GIS_Course/MScThesis-MaviSantarelli/results/bias_layers/Rana_temporaria_bias_layer.tif",
    "Lissotriton_helveticus": "C:/GIS_Course/MScThesis-MaviSantarelli/results/bias_layers/Lissotriton_helveticus_bias_layer.tif"
}

# Load KDE bias layers
kde_bias_layers = {}
for species, bias_path in bias_layers.items():
    with rasterio.open(bias_path) as src:
        kde_bias_layers[species] = src.read(1)  # Load as numpy array


In [38]:
import joblib

# Load GLM Models (Lasso and Ridge Regularization)
glm_lasso = joblib.load("C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/Final_GLM/Models/Bufo bufo_GLM_Lasso_Threshold_0.3_Model.pkl")
glm_ridge = joblib.load("C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/Final_GLM/Models/Bufo bufo_GLM_Ridge_Threshold_0.3_Model.pkl")

# Load GAM Model
gam_model = joblib.load("C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/Final_GAM/Bufo bufo_GAM_Model_CV.pkl")

# Load Random Forest Model
rf_model = joblib.load("C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/RandomForest/Bufo_bufo/RandomForest_Model.pkl")

# Load XGBoost Model
xgb_model = joblib.load("C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/XGBoost/Bufo_bufo/XGBoost_Model.pkl")

# Maxent will be handled differently (since .rds files are usually R-based)
maxent_path = "C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/Maxent/Maxent_Bufo_bufo.rds"


In [39]:
import rasterio
from rasterio.plot import show
import numpy as np
import os

# Define file paths to raster predictors
predictor_files = [
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Building_Density_Reversed.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/DistWater_Reversed.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/NOx_Stand_Reversed.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/RGS_Reversed.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Runoff_Coefficient_Standardised_Reversed.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Slope_Proj_Reversed.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/SoilMoisture_32bit_Reversed.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Traffic_Reversed.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Wood_Resample_Reversed.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Grass_Stand.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/NDVI_median.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/NDVI_StDev.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/VegHeight.tif"
]


In [40]:
import rasterio

# Check dimensions, resolution, and CRS of each raster
for file in predictor_files:
    with rasterio.open(file) as src:
        print(f"{os.path.basename(file)}")
        print(f"  Shape: {src.shape}")
        print(f"  Resolution: {src.res}")
        print(f"  CRS: {src.crs}")
        print(f"  Bounds: {src.bounds}\n")


Building_Density_Reversed.tif
  Shape: (5971, 6369)
  Resolution: (30.0, 30.0)
  CRS: EPSG:27700
  Bounds: BoundingBox(left=195957.49140268576, bottom=554940.9023270598, right=387027.49140268576, top=734070.9023270598)

DistWater_Reversed.tif
  Shape: (5971, 6369)
  Resolution: (30.0, 30.0)
  CRS: EPSG:27700
  Bounds: BoundingBox(left=195962.10163310927, bottom=554933.5372726112, right=387032.10163310927, top=734063.5372726112)

NOx_Stand_Reversed.tif
  Shape: (5971, 6369)
  Resolution: (30.0, 30.0)
  CRS: EPSG:27700
  Bounds: BoundingBox(left=195962.10163310927, bottom=554933.5372726112, right=387032.10163310927, top=734063.5372726112)

RGS_Reversed.tif
  Shape: (5971, 6369)
  Resolution: (30.0, 30.0)
  CRS: EPSG:27700
  Bounds: BoundingBox(left=195962.10163310927, bottom=554933.5372726112, right=387032.10163310927, top=734063.5372726112)

Runoff_Coefficient_Standardised_Reversed.tif
  Shape: (5970, 6369)
  Resolution: (30.0, 30.0)
  CRS: EPSG:27700
  Bounds: BoundingBox(left=195953.2

In [41]:
from rasterio.warp import reproject, Resampling
import numpy as np

# Use 'Building_Density_Reversed.tif' as the reference raster
reference_raster_path = "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Building_Density_Reversed.tif"

with rasterio.open(reference_raster_path) as ref_src:
    ref_meta = ref_src.meta.copy()
    ref_shape = ref_src.shape
    ref_transform = ref_src.transform
    ref_crs = ref_src.crs
    ref_bounds = ref_src.bounds

# Function to resample rasters to match reference raster
def resample_to_reference(raster_path, ref_meta, ref_transform, ref_crs, ref_shape):
    with rasterio.open(raster_path) as src:
        if (src.shape != ref_shape) or (src.transform != ref_transform):
            resampled_data = np.empty(ref_shape, dtype=src.dtypes[0])

            reproject(
                source=rasterio.band(src, 1),
                destination=resampled_data,
                src_transform=src.transform,
                src_crs=src.crs,
                dst_transform=ref_transform,
                dst_crs=ref_crs,
                resampling=Resampling.bilinear  # Use 'nearest' for categorical data
            )
            return resampled_data
        else:
            return src.read(1)

# Resample and align all rasters
aligned_rasters = []
for file in predictor_files:
    aligned_raster = resample_to_reference(file, ref_meta, ref_transform, ref_crs, ref_shape)
    aligned_rasters.append(aligned_raster)

# Stack aligned rasters into a 3D numpy array
predictors_array = np.stack(aligned_rasters, axis=-1)
print("Aligned predictors shape:", predictors_array.shape)


Aligned predictors shape: (5971, 6369, 13)


### 2.2 Add KDE Bias Layer to Aligned Predictors

In [42]:
# Load KDE Bias Layer
with rasterio.open("C:/GIS_Course/MScThesis-MaviSantarelli/results/bias_layers/Bufo_bufo_bias_layer.tif") as bias_src:
    if bias_src.shape != ref_shape:
        kde_bias = np.empty(ref_shape, dtype=bias_src.dtypes[0])

        reproject(
            source=rasterio.band(bias_src, 1),
            destination=kde_bias,
            src_transform=bias_src.transform,
            src_crs=bias_src.crs,
            dst_transform=ref_transform,
            dst_crs=ref_crs,
            resampling=Resampling.bilinear
        )
    else:
        kde_bias = bias_src.read(1)

# Add KDE Bias as the last predictor
predictors_with_bias = np.dstack((predictors_array, kde_bias))
print("Predictors with KDE Bias shape:", predictors_with_bias.shape)


Predictors with KDE Bias shape: (5971, 6369, 14)


### 2.3 Flatten for Model Predictions

In [43]:
# Flatten predictors for model input
num_rows, num_cols, num_predictors = predictors_with_bias.shape
flat_predictors = predictors_with_bias.reshape((num_rows * num_cols, num_predictors))

# Check for NaN values and create a mask for valid data
valid_mask = ~np.isnan(flat_predictors).any(axis=1)
flat_predictors_valid = flat_predictors[valid_mask]

print("Valid predictors shape:", flat_predictors_valid.shape)


Valid predictors shape: (38029299, 14)


### 2.3 Retrain Models

In [23]:
import pandas as pd

# Define file paths for training data only (partitioned data)
partitioned_train_files = {
    "Bufo bufo": {
        "GLM": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Bufo bufo_GLM_subsampled_train.csv",
        "GAM": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Bufo bufo_GAM_subsampled_train.csv",
        "Maxent": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Bufo bufo_Maxent_subsampled_train.csv",
        "RF": [f"C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Bufo bufo_RF_subsampled_run{i}_train.csv" for i in range(1, 11)],
        "XGBoost": [f"C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Bufo bufo_XGBoost_subsampled_run{i}_train.csv" for i in range(1, 11)]
    },
    "Rana temporaria": {
        "GLM": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Rana temporaria_GLM_subsampled_train.csv",
        "GAM": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Rana temporaria_GAM_subsampled_train.csv",
        "Maxent": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Rana temporaria_Maxent_subsampled_train.csv",
        "RF": [f"C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Rana temporaria_RF_subsampled_run{i}_train.csv" for i in range(1, 11)],
        "XGBoost": [f"C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Rana temporaria_XGBoost_subsampled_run{i}_train.csv" for i in range(1, 11)]
    },
    "Lissotriton helveticus": {
        "GLM": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Lissotriton helveticus_GLM_subsampled_train.csv",
        "GAM": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Lissotriton helveticus_GAM_subsampled_train.csv",
        "Maxent": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Lissotriton helveticus_Maxent_subsampled_train.csv",
        "RF": [f"C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Lissotriton helveticus_RF_subsampled_run{i}_train.csv" for i in range(1, 11)],
        "XGBoost": [f"C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Lissotriton helveticus_XGBoost_subsampled_run{i}_train.csv" for i in range(1, 11)]
    }
}

# Load the training data for each species and model into a dictionary
loaded_train_data = {}

for species, models in partitioned_train_files.items():
    print(f"Loading training data for {species}...")
    loaded_train_data[species] = {}
    
    for model_name, file_paths in models.items():
        print(f"  Loading training data for model: {model_name}...")
        
        # Handle single-file models (GLM, GAM, Maxent)
        if isinstance(file_paths, str):  # Single file for training
            loaded_train_data[species][model_name] = pd.read_csv(file_paths)
        else:  # Handle iterative models (RF, XGBoost)
            loaded_train_data[species][model_name] = [pd.read_csv(file_path) for file_path in file_paths]


Loading training data for Bufo bufo...
  Loading training data for model: GLM...
  Loading training data for model: GAM...
  Loading training data for model: Maxent...
  Loading training data for model: RF...
  Loading training data for model: XGBoost...
Loading training data for Rana temporaria...
  Loading training data for model: GLM...
  Loading training data for model: GAM...
  Loading training data for model: Maxent...
  Loading training data for model: RF...
  Loading training data for model: XGBoost...
Loading training data for Lissotriton helveticus...
  Loading training data for model: GLM...
  Loading training data for model: GAM...
  Loading training data for model: Maxent...
  Loading training data for model: RF...
  Loading training data for model: XGBoost...


In [24]:
import pandas as pd

# Define file paths for test data only (partitioned data)
partitioned_test_files = {
    "Bufo bufo": {
        "GLM": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Bufo bufo_GLM_subsampled_test.csv",
        "GAM": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Bufo bufo_GAM_subsampled_test.csv",
        "Maxent": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Bufo bufo_Maxent_subsampled_test.csv",
        "RF": [f"C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Bufo bufo_RF_subsampled_run{i}_test.csv" for i in range(1, 11)],
        "XGBoost": [f"C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Bufo bufo_XGBoost_subsampled_run{i}_test.csv" for i in range(1, 11)]
    },
    "Rana temporaria": {
        "GLM": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Rana temporaria_GLM_subsampled_test.csv",
        "GAM": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Rana temporaria_GAM_subsampled_test.csv",
        "Maxent": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Rana temporaria_Maxent_subsampled_test.csv",
        "RF": [f"C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Rana temporaria_RF_subsampled_run{i}_test.csv" for i in range(1, 11)],
        "XGBoost": [f"C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Rana temporaria_XGBoost_subsampled_run{i}_test.csv" for i in range(1, 11)]
    },
    "Lissotriton helveticus": {
        "GLM": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Lissotriton helveticus_GLM_subsampled_test.csv",
        "GAM": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Lissotriton helveticus_GAM_subsampled_test.csv",
        "Maxent": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Lissotriton helveticus_Maxent_subsampled_test.csv",
        "RF": [f"C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Lissotriton helveticus_RF_subsampled_run{i}_test.csv" for i in range(1, 11)],
        "XGBoost": [f"C:/GIS_Course/MScThesis-MaviSantarelli/data/Partitioned/Lissotriton helveticus_XGBoost_subsampled_run{i}_test.csv" for i in range(1, 11)]
    }
}

# Load the test data for each species and model into a dictionary
loaded_test_data = {}

for species, models in partitioned_test_files.items():
    print(f"Loading test data for {species}...")
    loaded_test_data[species] = {}
    
    for model_name, file_paths in models.items():
        print(f"  Loading test data for model: {model_name}...")
        
        # Handle single-file models (GLM, GAM, Maxent)
        if isinstance(file_paths, str):  # Single file for test
            loaded_test_data[species][model_name] = pd.read_csv(file_paths)
        else:  # Handle iterative models (RF, XGBoost)
            loaded_test_data[species][model_name] = [pd.read_csv(file_path) for file_path in file_paths]

Loading test data for Bufo bufo...
  Loading test data for model: GLM...
  Loading test data for model: GAM...
  Loading test data for model: Maxent...
  Loading test data for model: RF...
  Loading test data for model: XGBoost...
Loading test data for Rana temporaria...
  Loading test data for model: GLM...
  Loading test data for model: GAM...
  Loading test data for model: Maxent...
  Loading test data for model: RF...
  Loading test data for model: XGBoost...
Loading test data for Lissotriton helveticus...
  Loading test data for model: GLM...
  Loading test data for model: GAM...
  Loading test data for model: Maxent...
  Loading test data for model: RF...
  Loading test data for model: XGBoost...


#### **GLM**

In [28]:
from sklearn.linear_model import LogisticRegressionCV
import joblib
import rasterio

# Define species list and file paths
species_list = ["Bufo bufo", "Rana temporaria", "Lissotriton helveticus"]

# Iterate through each species to retrain the GLM model with KDE bias
for species in species_list:
    print(f"Retraining GLM model for {species} with KDE bias...")
    
    # Load the partitioned training data for GLM
    train_data_glm = loaded_train_data[species]["GLM"]
    
    # Load the corresponding KDE bias layer
    kde_bias_path = f"C:/GIS_Course/MScThesis-MaviSantarelli/results/bias_layers/{species.replace(' ', '_')}_bias_layer.tif"
    
    with rasterio.open(kde_bias_path) as src:
        kde_bias = src.read(1).flatten()
    
    # Ensure the KDE bias layer matches the length of the training data
    train_data_glm['kde_bias'] = kde_bias[:len(train_data_glm)]
    
    # Define predictors and target variable
    X_train = train_data_glm.drop(columns=['label'])  # Assuming 'label' is the target column
    y_train = train_data_glm['label']
    
    # Retrain the GLM model with Lasso regularization
    glm_lasso = LogisticRegressionCV(cv=5, penalty='l1', solver='liblinear', max_iter=1000)
    glm_lasso.fit(X_train, y_train)
    
    # Save the retrained GLM model
    model_save_path = f"C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/Final_GLM/Models/{species.replace(' ', '_')}_GLM_Lasso_with_KDE.pkl"
    joblib.dump(glm_lasso, model_save_path)
    
    print(f"GLM model retrained and saved for {species}: {model_save_path}\n")


Retraining GLM model for Bufo bufo with KDE bias...
GLM model retrained and saved for Bufo bufo: C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/Final_GLM/Models/Bufo_bufo_GLM_Lasso_with_KDE.pkl

Retraining GLM model for Rana temporaria with KDE bias...
GLM model retrained and saved for Rana temporaria: C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/Final_GLM/Models/Rana_temporaria_GLM_Lasso_with_KDE.pkl

Retraining GLM model for Lissotriton helveticus with KDE bias...
GLM model retrained and saved for Lissotriton helveticus: C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/Final_GLM/Models/Lissotriton_helveticus_GLM_Lasso_with_KDE.pkl



#### **GAM**

In [29]:
from pygam import LogisticGAM
import joblib
import rasterio

# Iterate through each species to retrain GAM models with KDE bias
for species in species_list:
    print(f"Retraining GAM model for {species} with KDE bias...")
    
    # Load the partitioned training data for GAM
    train_data_gam = loaded_train_data[species]["GAM"]
    
    # Load the corresponding KDE bias layer
    kde_bias_path = f"C:/GIS_Course/MScThesis-MaviSantarelli/results/bias_layers/{species.replace(' ', '_')}_bias_layer.tif"
    
    with rasterio.open(kde_bias_path) as src:
        kde_bias = src.read(1).flatten()
    
    # Ensure the KDE bias matches the training data length
    train_data_gam['kde_bias'] = kde_bias[:len(train_data_gam)]
    
    # Define predictors and target variable
    X_train = train_data_gam.drop(columns=['label'])  # Assuming 'label' is the target column
    y_train = train_data_gam['label']
    
    # Retrain GAM model
    gam_model = LogisticGAM().fit(X_train, y_train)
    
    # Save the retrained GAM model
    model_save_path = f"C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/Final_GAM/{species.replace(' ', '_')}_GAM_Model_with_KDE.pkl"
    joblib.dump(gam_model, model_save_path)
    
    print(f"GAM model retrained and saved for {species}: {model_save_path}\n")


Retraining GAM model for Bufo bufo with KDE bias...
GAM model retrained and saved for Bufo bufo: C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/Final_GAM/Bufo_bufo_GAM_Model_with_KDE.pkl

Retraining GAM model for Rana temporaria with KDE bias...
GAM model retrained and saved for Rana temporaria: C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/Final_GAM/Rana_temporaria_GAM_Model_with_KDE.pkl

Retraining GAM model for Lissotriton helveticus with KDE bias...
GAM model retrained and saved for Lissotriton helveticus: C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/Final_GAM/Lissotriton_helveticus_GAM_Model_with_KDE.pkl



#### **RF**

In [30]:
from sklearn.ensemble import RandomForestClassifier
import joblib
import rasterio

# Retrain Random Forest models for all species
for species in species_list:
    print(f"Retraining Random Forest models for {species} with KDE bias...")
    
    rf_runs = loaded_train_data[species]["RF"]
    rf_models = []
    
    # Load the KDE bias layer
    kde_bias_path = f"C:/GIS_Course/MScThesis-MaviSantarelli/results/bias_layers/{species.replace(' ', '_')}_bias_layer.tif"
    with rasterio.open(kde_bias_path) as src:
        kde_bias = src.read(1).flatten()
    
    # Retrain RF for each subsampled run
    for i, run_data in enumerate(rf_runs, start=1):
        run_data['kde_bias'] = kde_bias[:len(run_data)]
        
        X_train = run_data.drop(columns=['label'])
        y_train = run_data['label']
        
        rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
        rf_model.fit(X_train, y_train)
        
        # Save each model
        model_save_path = f"C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/RandomForest/{species.replace(' ', '_')}/RandomForest_Model_run{i}_with_KDE.pkl"
        joblib.dump(rf_model, model_save_path)
        rf_models.append(rf_model)
        
        print(f"Run {i} model saved for {species}: {model_save_path}")
    
    print(f"All Random Forest models retrained for {species}.\n")


Retraining Random Forest models for Bufo bufo with KDE bias...
Run 1 model saved for Bufo bufo: C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/RandomForest/Bufo_bufo/RandomForest_Model_run1_with_KDE.pkl
Run 2 model saved for Bufo bufo: C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/RandomForest/Bufo_bufo/RandomForest_Model_run2_with_KDE.pkl
Run 3 model saved for Bufo bufo: C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/RandomForest/Bufo_bufo/RandomForest_Model_run3_with_KDE.pkl
Run 4 model saved for Bufo bufo: C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/RandomForest/Bufo_bufo/RandomForest_Model_run4_with_KDE.pkl
Run 5 model saved for Bufo bufo: C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/RandomForest/Bufo_bufo/RandomForest_Model_run5_with_KDE.pkl
Run 6 model saved for Bufo bufo: C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/RandomForest/Bufo_bufo/RandomForest_Model_run6_with_KDE.pkl
Run 7 model saved for Bufo bufo: C:/GIS_Course/MScThesi

In [65]:
import joblib
import os

# Define species list
species_list = ["Bufo_bufo", "Rana_temporaria", "Lissotriton_helveticus"]

# Initialize dictionary to store Random Forest models for each species
rf_models_all_species = {}

# Load Random Forest models for each species
for species in species_list:
    print(f"Loading Random Forest models for {species}...")
    
    rf_model_dir = f"C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/RandomForest/{species}/"
    rf_models = []
    
    # Load all 10 iterations
    for i in range(1, 11):
        model_path = os.path.join(rf_model_dir, f"RandomForest_Model_run{i}_with_KDE.pkl")
        rf_models.append(joblib.load(model_path))
    
    rf_models_all_species[species] = rf_models
    print(f"{len(rf_models)} Random Forest models loaded for {species}.\n")


Loading Random Forest models for Bufo_bufo...
10 Random Forest models loaded for Bufo_bufo.

Loading Random Forest models for Rana_temporaria...
10 Random Forest models loaded for Rana_temporaria.

Loading Random Forest models for Lissotriton_helveticus...
10 Random Forest models loaded for Lissotriton_helveticus.



#### **XGBoost**

In [31]:
from xgboost import XGBClassifier
import joblib
import rasterio

# Retrain XGBoost models for all species
for species in species_list:
    print(f"Retraining XGBoost models for {species} with KDE bias...")
    
    xgb_runs = loaded_train_data[species]["XGBoost"]
    xgb_models = []
    
    # Load the KDE bias layer
    kde_bias_path = f"C:/GIS_Course/MScThesis-MaviSantarelli/results/bias_layers/{species.replace(' ', '_')}_bias_layer.tif"
    with rasterio.open(kde_bias_path) as src:
        kde_bias = src.read(1).flatten()
    
    # Retrain XGBoost for each subsampled run
    for i, run_data in enumerate(xgb_runs, start=1):
        run_data['kde_bias'] = kde_bias[:len(run_data)]
        
        X_train = run_data.drop(columns=['label'])
        y_train = run_data['label']
        
        xgb_model = XGBClassifier(n_estimators=100, random_state=42)
        xgb_model.fit(X_train, y_train)
        
        # Save each model
        model_save_path = f"C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/XGBoost/{species.replace(' ', '_')}/XGBoost_Model_run{i}_with_KDE.pkl"
        joblib.dump(xgb_model, model_save_path)
        xgb_models.append(xgb_model)
        
        print(f"Run {i} model saved for {species}: {model_save_path}")
    
    print(f"All XGBoost models retrained for {species}.\n")


Retraining XGBoost models for Bufo bufo with KDE bias...
Run 1 model saved for Bufo bufo: C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/XGBoost/Bufo_bufo/XGBoost_Model_run1_with_KDE.pkl
Run 2 model saved for Bufo bufo: C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/XGBoost/Bufo_bufo/XGBoost_Model_run2_with_KDE.pkl
Run 3 model saved for Bufo bufo: C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/XGBoost/Bufo_bufo/XGBoost_Model_run3_with_KDE.pkl
Run 4 model saved for Bufo bufo: C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/XGBoost/Bufo_bufo/XGBoost_Model_run4_with_KDE.pkl
Run 5 model saved for Bufo bufo: C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/XGBoost/Bufo_bufo/XGBoost_Model_run5_with_KDE.pkl
Run 6 model saved for Bufo bufo: C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/XGBoost/Bufo_bufo/XGBoost_Model_run6_with_KDE.pkl
Run 7 model saved for Bufo bufo: C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/XGBoost/Bufo_bufo/XGBoost_Model_ru

In [66]:
# Initialize dictionary to store XGBoost models for each species
xgb_models_all_species = {}

# Load XGBoost models for each species
for species in species_list:
    print(f"Loading XGBoost models for {species}...")
    
    xgb_model_dir = f"C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/XGBoost/{species}/"
    xgb_models = []
    
    # Load all 10 iterations
    for i in range(1, 11):
        model_path = os.path.join(xgb_model_dir, f"XGBoost_Model_run{i}_with_KDE.pkl")
        xgb_models.append(joblib.load(model_path))
    
    xgb_models_all_species[species] = xgb_models
    print(f"{len(xgb_models)} XGBoost models loaded for {species}.\n")


Loading XGBoost models for Bufo_bufo...
10 XGBoost models loaded for Bufo_bufo.

Loading XGBoost models for Rana_temporaria...
10 XGBoost models loaded for Rana_temporaria.

Loading XGBoost models for Lissotriton_helveticus...
10 XGBoost models loaded for Lissotriton_helveticus.



### **Maxent**

In [33]:
# Import RPy2 and required Python libraries
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri, numpy2ri
import rasterio
import pandas as pd

# Activate automatic conversion between Pandas, NumPy, and R DataFrames
pandas2ri.activate()
numpy2ri.activate()

# Load required R packages
robjects.r('''
    library(dismo)
    library(rJava)
    library(raster)
    library(sp)
''')

print("R connection established and packages loaded successfully!")


R[write to console]: Loading required package: raster

R[write to console]: Loading required package: sp



R connection established and packages loaded successfully!


In [34]:
# Define the species list
species_list = ["Bufo bufo", "Rana temporaria", "Lissotriton helveticus"]

# Load KDE bias layers and transfer to R
for species in species_list:
    kde_bias_path = f"C:/GIS_Course/MScThesis-MaviSantarelli/results/bias_layers/{species.replace(' ', '_')}_bias_layer.tif"
    
    with rasterio.open(kde_bias_path) as src:
        kde_bias = src.read(1)  # Load as NumPy array
    
    # Transfer KDE bias layer to R
    robjects.globalenv[f"{species.replace(' ', '_')}_kde_bias"] = numpy2ri.py2rpy(kde_bias)
    
    print(f"KDE bias layer for {species} successfully transferred to R!")


KDE bias layer for Bufo bufo successfully transferred to R!
KDE bias layer for Rana temporaria successfully transferred to R!
KDE bias layer for Lissotriton helveticus successfully transferred to R!


In [35]:
# Extract the training datasets from the previously loaded dictionary
bufo_train = loaded_train_data["Bufo bufo"]["Maxent"]
rana_train = loaded_train_data["Rana temporaria"]["Maxent"]
lissotriton_train = loaded_train_data["Lissotriton helveticus"]["Maxent"]

# Convert Pandas DataFrames to R DataFrames
robjects.globalenv["bufo_train"] = pandas2ri.py2rpy(bufo_train)
robjects.globalenv["rana_train"] = pandas2ri.py2rpy(rana_train)
robjects.globalenv["lissotriton_train"] = pandas2ri.py2rpy(lissotriton_train)

print("Maxent training data successfully transferred to R!")


Maxent training data successfully transferred to R!


In [54]:
robjects.r('''
    library(dismo)
    
    # Simplified Maxent run without bias file
    predictor_cols <- colnames(bufo_train)[-length(colnames(bufo_train))]  # Exclude label

    maxent_model <- maxent(
        x = bufo_train[, predictor_cols],
        p = bufo_train[, "label"],
        path = "C:/GIS_Course/MScThesis-MaviSantarelli/results/Maxent_Output/Bufo_bufo",
        args = c("autofeature=true", "randomtestpoints=20")
    )

    print("Maxent model for Bufo bufo executed successfully without bias file!")
''')


[1] "Maxent model for Bufo bufo executed successfully without bias file!"


In [61]:
robjects.r('''
    library(dismo)
    
    # Simplified Maxent run for Rana temporaria without bias file
    predictor_cols <- colnames(rana_train)[-length(colnames(rana_train))]  # Exclude label

    maxent_model <- maxent(
        x = rana_train[, predictor_cols],
        p = rana_train[, "label"],
        path = "C:/GIS_Course/MScThesis-MaviSantarelli/results/Maxent_Output/Rana_temporaria",
        args = c("autofeature=true", "randomtestpoints=20")
    )

    print("Maxent model for Rana temporaria executed successfully without bias file!")
''')


[1] "Maxent model for Rana temporaria executed successfully without bias file!"


In [62]:
robjects.r('''
    library(dismo)
    
    # Simplified Maxent run for Lissotriton helveticus without bias file
    predictor_cols <- colnames(lissotriton_train)[-length(colnames(lissotriton_train))]  # Exclude label

    maxent_model <- maxent(
        x = lissotriton_train[, predictor_cols],
        p = lissotriton_train[, "label"],
        path = "C:/GIS_Course/MScThesis-MaviSantarelli/results/Maxent_Output/Lissotriton_helveticus",
        args = c("autofeature=true", "randomtestpoints=20")
    )

    print("Maxent model for Lissotriton helveticus executed successfully without bias file!")
''')


[1] "Maxent model for Lissotriton helveticus executed successfully without bias file!"


In [56]:
robjects.r('''
    library(raster)

    # Function to resample and save KDE bias raster with square pixels
    save_resampled_bias_raster <- function(bias_layer, species_name) {
        bias_raster <- raster(bias_layer)
        
        # Check the resolution of the raster
        res_x <- res(bias_raster)[1]  # Horizontal resolution
        res_y <- res(bias_raster)[2]  # Vertical resolution
        
        # If the resolutions are unequal, resample to square pixels
        if (res_x != res_y) {
            print(paste("Resampling", species_name, "bias raster to square pixels."))
            
            # Use the smaller resolution to avoid loss of detail
            new_res <- min(res_x, res_y)
            
            # Resample raster
            bias_raster <- resample(bias_raster, raster(extent(bias_raster), res = new_res), method = "bilinear")
        } else {
            print(paste("No resampling needed for", species_name, "- raster already has square pixels."))
        }
        
        # Save the resampled raster as ASCII
        output_path <- paste0("C:/GIS_Course/MScThesis-MaviSantarelli/results/Maxent_Output/", species_name, "_kde_bias.asc")
        writeRaster(bias_raster, output_path, format = "ascii", overwrite = TRUE)
        
        return(output_path)
    }
    
    # Save KDE bias layers with resampling if needed
    bufo_bias_path <- save_resampled_bias_raster(Bufo_bufo_kde_bias, "Bufo_bufo")
    rana_bias_path <- save_resampled_bias_raster(Rana_temporaria_kde_bias, "Rana_temporaria")
    lissotriton_bias_path <- save_resampled_bias_raster(Lissotriton_helveticus_kde_bias, "Lissotriton_helveticus")
''')

print("KDE bias layers resampled (if needed) and saved as ASCII rasters.")


[1] "No resampling needed for Bufo_bufo - raster already has square pixels."
[1] "No resampling needed for Rana_temporaria - raster already has square pixels."
[1] "No resampling needed for Lissotriton_helveticus - raster already has square pixels."
KDE bias layers resampled (if needed) and saved as ASCII rasters.


In [63]:
robjects.r('''
    library(dismo)

    # Function to train and save MaxEnt model with KDE bias
    train_maxent_with_bias <- function(train_data, species_name, bias_file_path) {
        predictor_cols <- colnames(train_data)[-length(colnames(train_data))]  # Exclude label
        
        # Define Maxent output directory
        output_dir <- paste0("C:/GIS_Course/MScThesis-MaviSantarelli/results/Maxent_Output/", species_name)
        dir.create(output_dir, showWarnings = FALSE, recursive = TRUE)

        # Train MaxEnt model with bias file
        maxent_model <- maxent(
            x = train_data[, predictor_cols],
            p = train_data[, "label"],
            path = output_dir,
            args = c(paste0("biasfile=", bias_file_path), "autofeature=true", "randomtestpoints=20")
        )

        # Save the trained Maxent model
        saveRDS(maxent_model, paste0("C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/Maxent_", species_name, "_with_KDE.rds"))
        
        print(paste("MaxEnt model with KDE bias trained and saved for", species_name))
    }

    # Train models with the resampled KDE bias rasters
    train_maxent_with_bias(bufo_train, "Bufo_bufo", bufo_bias_path)
    train_maxent_with_bias(rana_train, "Rana_temporaria", rana_bias_path)
    train_maxent_with_bias(lissotriton_train, "Lissotriton_helveticus", lissotriton_bias_path)
''')


[1] "MaxEnt model with KDE bias trained and saved for Bufo_bufo"
[1] "MaxEnt model with KDE bias trained and saved for Rana_temporaria"
[1] "MaxEnt model with KDE bias trained and saved for Lissotriton_helveticus"


## 3. Generate SDM Predictions
### 3.1  Load All Models for All Species
#### **GLM and GAM** 

In [2]:
# Define the species list
species_list = ["Bufo_bufo", "Rana_temporaria", "Lissotriton_helveticus"]
import joblib


# Initialize dictionaries to store GLM and GAM models
glm_models_all_species = {}
gam_models_all_species = {}

# Load GLM and GAM models for all species
for species in species_list:
    print(f"Loading GLM and GAM models for {species}...")
    
    glm_model_path = f"C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/Final_GLM/Models/{species}_GLM_Lasso_with_KDE.pkl"
    gam_model_path = f"C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/Final_GAM/{species}_GAM_Model_with_KDE.pkl"
    
    glm_models_all_species[species] = joblib.load(glm_model_path)
    gam_models_all_species[species] = joblib.load(gam_model_path)
    
    print(f"GLM and GAM models loaded for {species}.\n")


Loading GLM and GAM models for Bufo_bufo...
GLM and GAM models loaded for Bufo_bufo.

Loading GLM and GAM models for Rana_temporaria...
GLM and GAM models loaded for Rana_temporaria.

Loading GLM and GAM models for Lissotriton_helveticus...
GLM and GAM models loaded for Lissotriton_helveticus.



#### **RF and XGBoost**

In [3]:
import os
# Initialize dictionaries for RF and XGBoost models
rf_models_all_species = {}
xgb_models_all_species = {}

# Load RF and XGBoost models for all species
for species in species_list:
    print(f"Loading Random Forest and XGBoost models for {species}...")
    
    # Load Random Forest models
    rf_model_dir = f"C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/RandomForest/{species}/"
    rf_models = [joblib.load(os.path.join(rf_model_dir, f"RandomForest_Model_run{i}_with_KDE.pkl")) for i in range(1, 11)]
    rf_models_all_species[species] = rf_models
    
    # Load XGBoost models
    xgb_model_dir = f"C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/XGBoost/{species}/"
    xgb_models = [joblib.load(os.path.join(xgb_model_dir, f"XGBoost_Model_run{i}_with_KDE.pkl")) for i in range(1, 11)]
    xgb_models_all_species[species] = xgb_models
    
    print(f"Random Forest and XGBoost models loaded for {species}.\n")


Loading Random Forest and XGBoost models for Bufo_bufo...
Random Forest and XGBoost models loaded for Bufo_bufo.

Loading Random Forest and XGBoost models for Rana_temporaria...
Random Forest and XGBoost models loaded for Rana_temporaria.

Loading Random Forest and XGBoost models for Lissotriton_helveticus...
Random Forest and XGBoost models loaded for Lissotriton_helveticus.



#### **Maxent**

In [4]:
import rpy2.robjects as robjects

# Load Maxent models for all species in R
for species in species_list:
    maxent_model_path = f"C:/GIS_Course/MScThesis-MaviSantarelli/results/Models/Maxent_{species}_with_KDE.rds"
    
    robjects.r(f'''
        library(dismo)
        {species}_maxent_model <- readRDS("{maxent_model_path}")
    ''')
    
    print(f"Maxent model loaded for {species} in R.")


R[write to console]: Loading required package: raster

R[write to console]: Loading required package: sp



Maxent model loaded for Bufo_bufo in R.
Maxent model loaded for Rana_temporaria in R.
Maxent model loaded for Lissotriton_helveticus in R.


### 3.2 Generate and Aggregate Predictions for All Models

#### **GLM and GAM**

In [44]:
# Flatten the predictors for model input
num_rows, num_cols, num_predictors = predictors_stack.shape
flat_predictors = predictors_stack.reshape((num_rows * num_cols, num_predictors))

# Mask out invalid (NaN) values
valid_mask = ~np.isnan(flat_predictors).any(axis=1)
flat_predictors_valid = flat_predictors[valid_mask]

# Create DataFrame with feature names
feature_names = [
    'Building_Density_Reversed', 'DistWater_Reversed', 'NOx_Stand_Reversed',
    'RGS_Reversed', 'Runoff_Coefficient_Reversed', 'Slope_Proj_Reversed',
    'SoilMoisture_Reversed', 'Traffic_Reversed', 'Wood_Resample_Reversed',
    'Grass_Stand', 'NDVI_median', 'NDVI_StDev', 'VegHeight'
]

import pandas as pd
flat_predictors_df = pd.DataFrame(flat_predictors_valid, columns=feature_names)
print("DataFrame for predictions created successfully.")


DataFrame for predictions created successfully.


In [45]:
# Rename columns to match the trained feature names
flat_predictors_df.rename(columns=feature_name_mapping, inplace=True)

# Verify the updated feature names
print("Updated feature names for prediction:", flat_predictors_df.columns)


Updated feature names for prediction: Index(['C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Building_Density_Reversed.tif',
       'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/DistWater_Reversed.tif',
       'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/NOx_Stand_Reversed.tif',
       'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/RGS_Reversed.tif',
       'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Runoff_Coefficient_Standardised_Reversed.tif',
       'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Slope_Proj_Reversed.tif',
       'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/SoilMoisture_32bit_Reversed.tif',
       'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Traffic_Reversed.tif',
       'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Wood_Resample_Reversed.tif',


In [48]:
from rasterio.enums import Resampling

# Load a reference raster (first predictor) to match the KDE bias layer
reference_raster_path = predictor_files[0]

with rasterio.open(reference_raster_path) as ref_src:
    ref_profile = ref_src.profile
    ref_shape = ref_src.shape
    ref_transform = ref_src.transform

# Resample the KDE bias layer to match the reference raster
with rasterio.open(kde_bias_path) as src:
    if src.shape != ref_shape:
        print("Resampling KDE bias layer to match predictors...")
        kde_bias_resampled = src.read(
            out_shape=(src.count, ref_shape[0], ref_shape[1]),
            resampling=Resampling.bilinear
        )[0]  # Take the first band
    else:
        kde_bias_resampled = src.read(1)

# Flatten and mask out NaNs
kde_bias_array = kde_bias_resampled.flatten()
kde_bias_valid = kde_bias_array[valid_mask]

# Add the KDE bias to the predictors DataFrame
flat_predictors_df['kde_bias'] = kde_bias_valid


Resampling KDE bias layer to match predictors...


In [49]:
# Generate GLM predictions with the updated predictors
glm_predictions_all_species = {}

for species in species_list:
    glm_model = glm_models_all_species[species]
    glm_predictions_all_species[species] = glm_model.predict_proba(flat_predictors_df)[:, 1]

print("GLM predictions generated successfully for all species.")


GLM predictions generated successfully for all species.


In [50]:
import os
import joblib

# Directory to save GLM predictions
glm_output_dir = "C:/GIS_Course/MScThesis-MaviSantarelli/results/Model_Predictions/GLM"
os.makedirs(glm_output_dir, exist_ok=True)


In [51]:
# Save GLM predictions for each species
for species in species_list:
    glm_save_path = os.path.join(glm_output_dir, f"{species}_GLM_Predictions.pkl")
    joblib.dump(glm_predictions_all_species[species], glm_save_path)
    print(f"GLM predictions saved for {species} at {glm_save_path}")


GLM predictions saved for Bufo_bufo at C:/GIS_Course/MScThesis-MaviSantarelli/results/Model_Predictions/GLM\Bufo_bufo_GLM_Predictions.pkl
GLM predictions saved for Rana_temporaria at C:/GIS_Course/MScThesis-MaviSantarelli/results/Model_Predictions/GLM\Rana_temporaria_GLM_Predictions.pkl
GLM predictions saved for Lissotriton_helveticus at C:/GIS_Course/MScThesis-MaviSantarelli/results/Model_Predictions/GLM\Lissotriton_helveticus_GLM_Predictions.pkl


In [52]:
import pandas as pd

# Save GLM predictions as CSV
for species in species_list:
    glm_csv_path = os.path.join(glm_output_dir, f"{species}_GLM_Predictions.csv")
    pd.DataFrame(glm_predictions_all_species[species], columns=["Habitat_Suitability"]).to_csv(glm_csv_path, index=False)
    print(f"GLM predictions saved as CSV for {species} at {glm_csv_path}")


GLM predictions saved as CSV for Bufo_bufo at C:/GIS_Course/MScThesis-MaviSantarelli/results/Model_Predictions/GLM\Bufo_bufo_GLM_Predictions.csv
GLM predictions saved as CSV for Rana_temporaria at C:/GIS_Course/MScThesis-MaviSantarelli/results/Model_Predictions/GLM\Rana_temporaria_GLM_Predictions.csv
GLM predictions saved as CSV for Lissotriton_helveticus at C:/GIS_Course/MScThesis-MaviSantarelli/results/Model_Predictions/GLM\Lissotriton_helveticus_GLM_Predictions.csv


#### **RF and XBGoost**

In [54]:
from rasterio.enums import Resampling

# Load a reference raster (first predictor) to match the KDE bias layer
reference_raster_path = predictor_files[0]

with rasterio.open(reference_raster_path) as ref_src:
    ref_shape = ref_src.shape

# Resample the KDE bias layer to match the reference raster
with rasterio.open(kde_bias_path) as src:
    kde_bias_resampled = src.read(
        out_shape=(src.count, ref_shape[0], ref_shape[1]),
        resampling=Resampling.bilinear
    )[0]

# Flatten and mask to align with predictors
kde_bias_array = kde_bias_resampled.flatten()
kde_bias_valid = kde_bias_array[valid_mask]

# Add the KDE bias to predictors
flat_predictors_df['kde_bias'] = kde_bias_valid

# (Optional) Save the resampled KDE bias for future use
joblib.dump(kde_bias_valid, "C:/GIS_Course/MScThesis-MaviSantarelli/results/bias_layers/Bufo_bufo_bias_layer_resampled.pkl")


['C:/GIS_Course/MScThesis-MaviSantarelli/results/bias_layers/Bufo_bufo_bias_layer_resampled.pkl']

In [55]:
import geopandas as gpd

# Load the study area shapefile
study_area_path = "C:/GIS_Course/MScThesis-MaviSantarelli/data/StudyArea/Study_Area.shp"
study_area_gdf = gpd.read_file(study_area_path)

# Ensure CRS matches the raster data (usually EPSG:27700)
print(f"Study Area CRS: {study_area_gdf.crs}")


Study Area CRS: EPSG:27700


In [56]:
# Map simplified names to original file paths (matching trained feature names)
feature_name_mapping = {
    'Building_Density_Reversed': 'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Building_Density_Reversed.tif',
    'DistWater_Reversed': 'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/DistWater_Reversed.tif',
    'NOx_Stand_Reversed': 'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/NOx_Stand_Reversed.tif',
    'RGS_Reversed': 'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/RGS_Reversed.tif',
    'Runoff_Coefficient_Reversed': 'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Runoff_Coefficient_Standardised_Reversed.tif',
    'Slope_Proj_Reversed': 'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Slope_Proj_Reversed.tif',
    'SoilMoisture_Reversed': 'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/SoilMoisture_32bit_Reversed.tif',
    'Traffic_Reversed': 'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Traffic_Reversed.tif',
    'Wood_Resample_Reversed': 'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Wood_Resample_Reversed.tif',
    'Grass_Stand': 'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Grass_Stand.tif',
    'NDVI_median': 'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/NDVI_median.tif',
    'NDVI_StDev': 'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/NDVI_StDev.tif',
    'VegHeight': 'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/VegHeight.tif',
    'kde_bias': 'kde_bias'  # Ensure this matches the bias layer used during training
}


In [None]:
from rasterio.mask import mask
# Function to mask raster using the study area
def mask_raster_to_study_area(raster_path, study_area_gdf):
    with rasterio.open(raster_path) as src:
        # Reproject study area to match raster CRS if needed
        if study_area_gdf.crs != src.crs:
            study_area_gdf = study_area_gdf.to_crs(src.crs)
        
        # Apply mask
        masked_image, _ = mask(src, study_area_gdf.geometry, crop=True)
        return masked_image[0]  # First band

# Load and mask predictors
predictor_files = [
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Building_Density_Reversed.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/DistWater_Reversed.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/NOx_Stand_Reversed.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/RGS_Reversed.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Runoff_Coefficient_Standardised_Reversed.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Slope_Proj_Reversed.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/SoilMoisture_32bit_Reversed.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Traffic_Reversed.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Wood_Resample_Reversed.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Grass_Stand.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/NDVI_median.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/NDVI_StDev.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/VegHeight.tif"
]

# Mask all predictors
masked_predictor_arrays = [mask_raster_to_study_area(file, study_area_gdf) for file in predictor_files]

In [60]:
from skimage.transform import resize

# Reference shape from the first masked raster
ref_shape = masked_predictor_arrays[0].shape

# Resample all masked rasters to match the reference shape
masked_predictor_arrays_aligned = []
for array, file in zip(masked_predictor_arrays, predictor_files):
    if array.shape != ref_shape:
        print(f"Resampling {file} to match reference shape...")
        resampled_array = resize(array, ref_shape, mode='reflect', anti_aliasing=True)
        masked_predictor_arrays_aligned.append(resampled_array)
    else:
        masked_predictor_arrays_aligned.append(array)

# Now stack the aligned masked predictors
masked_predictors_stack = np.stack(masked_predictor_arrays_aligned, axis=-1)
print("Masked predictors successfully aligned and stacked.")


Resampling C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Runoff_Coefficient_Standardised_Reversed.tif to match reference shape...
Resampling C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/NDVI_StDev.tif to match reference shape...
Masked predictors successfully aligned and stacked.


In [61]:
# Load and mask the KDE bias layer (ensure it matches the aligned predictors)
kde_bias_path = "C:/GIS_Course/MScThesis-MaviSantarelli/results/bias_layers/Bufo_bufo_bias_layer.tif"
masked_kde_bias = mask_raster_to_study_area(kde_bias_path, study_area_gdf)

# Resample the KDE bias to match the reference shape if needed
if masked_kde_bias.shape != ref_shape:
    masked_kde_bias = resize(masked_kde_bias, ref_shape, mode='reflect', anti_aliasing=True)

# Combine predictors with KDE bias
masked_predictors_with_bias = np.dstack((masked_predictors_stack, masked_kde_bias))

# Flatten and prepare for model predictions
num_rows, num_cols, num_predictors = masked_predictors_with_bias.shape
flat_masked_predictors = masked_predictors_with_bias.reshape((num_rows * num_cols, num_predictors))

# Mask out NaNs
valid_mask = ~np.isnan(flat_masked_predictors).any(axis=1)
flat_predictors_valid_masked = flat_masked_predictors[valid_mask]


In [62]:
import pandas as pd

# Define simplified feature names (including KDE bias)
feature_names = [
    'Building_Density_Reversed', 'DistWater_Reversed', 'NOx_Stand_Reversed',
    'RGS_Reversed', 'Runoff_Coefficient_Reversed', 'Slope_Proj_Reversed',
    'SoilMoisture_Reversed', 'Traffic_Reversed', 'Wood_Resample_Reversed',
    'Grass_Stand', 'NDVI_median', 'NDVI_StDev', 'VegHeight', 'kde_bias'
]

# Create DataFrame with aligned predictors
flat_predictors_df_masked = pd.DataFrame(flat_predictors_valid_masked, columns=feature_names)


In [63]:
# Mapping simplified names to original file paths (used during training)
feature_name_mapping = {
    'Building_Density_Reversed': 'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Building_Density_Reversed.tif',
    'DistWater_Reversed': 'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/DistWater_Reversed.tif',
    'NOx_Stand_Reversed': 'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/NOx_Stand_Reversed.tif',
    'RGS_Reversed': 'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/RGS_Reversed.tif',
    'Runoff_Coefficient_Reversed': 'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Runoff_Coefficient_Standardised_Reversed.tif',
    'Slope_Proj_Reversed': 'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Slope_Proj_Reversed.tif',
    'SoilMoisture_Reversed': 'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/SoilMoisture_32bit_Reversed.tif',
    'Traffic_Reversed': 'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Traffic_Reversed.tif',
    'Wood_Resample_Reversed': 'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Wood_Resample_Reversed.tif',
    'Grass_Stand': 'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Grass_Stand.tif',
    'NDVI_median': 'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/NDVI_median.tif',
    'NDVI_StDev': 'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/NDVI_StDev.tif',
    'VegHeight': 'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/VegHeight.tif',
    'kde_bias': 'kde_bias'  # Assuming this matches during training
}

# Rename columns to match feature names used during training
flat_predictors_df_masked.rename(columns=feature_name_mapping, inplace=True)

# Verify updated feature names
print("Updated feature names for prediction:", flat_predictors_df_masked.columns)


Updated feature names for prediction: Index(['C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Building_Density_Reversed.tif',
       'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/DistWater_Reversed.tif',
       'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/NOx_Stand_Reversed.tif',
       'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/RGS_Reversed.tif',
       'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Runoff_Coefficient_Standardised_Reversed.tif',
       'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Slope_Proj_Reversed.tif',
       'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/SoilMoisture_32bit_Reversed.tif',
       'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Traffic_Reversed.tif',
       'C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Wood_Resample_Reversed.tif',


In [64]:
rf_predictions_all_species = {}
xgb_predictions_all_species = {}

for species in species_list:
    print(f"Generating RF and XGBoost predictions for {species} using masked and renamed predictors...")

    # Random Forest predictions
    rf_models = rf_models_all_species[species]
    rf_predictions_list = [model.predict_proba(flat_predictors_df_masked)[:, 1] for model in rf_models]
    rf_predictions_all_species[species] = np.mean(rf_predictions_list, axis=0)

    # XGBoost predictions
    xgb_models = xgb_models_all_species[species]
    xgb_predictions_list = [model.predict_proba(flat_predictors_df_masked)[:, 1] for model in xgb_models]
    xgb_predictions_all_species[species] = np.mean(xgb_predictions_list, axis=0)

    print(f"Predictions generated for {species}.\n")


Generating RF and XGBoost predictions for Bufo_bufo using masked and renamed predictors...
Predictions generated for Bufo_bufo.

Generating RF and XGBoost predictions for Rana_temporaria using masked and renamed predictors...


MemoryError: Unable to allocate 580. MiB for an array with shape (38029299, 1, 2) and data type float64

In [65]:
import os
import joblib

# Directory to save *Bufo bufo* predictions
bufo_predictions_dir = "C:/GIS_Course/MScThesis-MaviSantarelli/results/Model_Predictions/Bufo_bufo"
os.makedirs(bufo_predictions_dir, exist_ok=True)


In [66]:
# Save Random Forest predictions
rf_bufo_save_path = os.path.join(bufo_predictions_dir, "Bufo_bufo_RF_Predictions_Masked.pkl")
joblib.dump(rf_predictions_all_species["Bufo_bufo"], rf_bufo_save_path)
print(f"Random Forest predictions saved for Bufo bufo at {rf_bufo_save_path}")

# Save XGBoost predictions
xgb_bufo_save_path = os.path.join(bufo_predictions_dir, "Bufo_bufo_XGBoost_Predictions_Masked.pkl")
joblib.dump(xgb_predictions_all_species["Bufo_bufo"], xgb_bufo_save_path)
print(f"XGBoost predictions saved for Bufo bufo at {xgb_bufo_save_path}")


Random Forest predictions saved for Bufo bufo at C:/GIS_Course/MScThesis-MaviSantarelli/results/Model_Predictions/Bufo_bufo\Bufo_bufo_RF_Predictions_Masked.pkl
XGBoost predictions saved for Bufo bufo at C:/GIS_Course/MScThesis-MaviSantarelli/results/Model_Predictions/Bufo_bufo\Bufo_bufo_XGBoost_Predictions_Masked.pkl


In [67]:
import pandas as pd

# Save Random Forest predictions as CSV
rf_bufo_csv_path = os.path.join(bufo_predictions_dir, "Bufo_bufo_RF_Predictions_Masked.csv")
pd.DataFrame(rf_predictions_all_species["Bufo_bufo"], columns=["Habitat_Suitability"]).to_csv(rf_bufo_csv_path, index=False)
print(f"Random Forest predictions saved as CSV for Bufo bufo at {rf_bufo_csv_path}")

# Save XGBoost predictions as CSV
xgb_bufo_csv_path = os.path.join(bufo_predictions_dir, "Bufo_bufo_XGBoost_Predictions_Masked.csv")
pd.DataFrame(xgb_predictions_all_species["Bufo_bufo"], columns=["Habitat_Suitability"]).to_csv(xgb_bufo_csv_path, index=False)
print(f"XGBoost predictions saved as CSV for Bufo bufo at {xgb_bufo_csv_path}")


Random Forest predictions saved as CSV for Bufo bufo at C:/GIS_Course/MScThesis-MaviSantarelli/results/Model_Predictions/Bufo_bufo\Bufo_bufo_RF_Predictions_Masked.csv
XGBoost predictions saved as CSV for Bufo bufo at C:/GIS_Course/MScThesis-MaviSantarelli/results/Model_Predictions/Bufo_bufo\Bufo_bufo_XGBoost_Predictions_Masked.csv


In [68]:
# Function to process predictions in batches
def batch_predict(model, data, batch_size=500000):
    predictions = []
    total_samples = data.shape[0]
    
    for start in range(0, total_samples, batch_size):
        end = min(start + batch_size, total_samples)
        batch = data[start:end]
        predictions.append(model.predict_proba(batch)[:, 1])
    
    return np.concatenate(predictions)


In [69]:
rf_predictions_all_species = {}
xgb_predictions_all_species = {}

for species in species_list:
    if species == "Bufo_bufo":
        print(f"Skipping predictions for {species} (already saved).")
        continue

    print(f"Generating RF and XGBoost predictions for {species} using masked and renamed predictors...")

    # Random Forest predictions with batch processing
    rf_models = rf_models_all_species[species]
    rf_predictions_list = [batch_predict(model, flat_predictors_df_masked) for model in rf_models]
    rf_predictions_all_species[species] = np.mean(rf_predictions_list, axis=0)

    # XGBoost predictions with batch processing
    xgb_models = xgb_models_all_species[species]
    xgb_predictions_list = [batch_predict(model, flat_predictors_df_masked) for model in xgb_models]
    xgb_predictions_all_species[species] = np.mean(xgb_predictions_list, axis=0)

    print(f"Predictions generated for {species}.\n")


Skipping predictions for Bufo_bufo (already saved).
Generating RF and XGBoost predictions for Rana_temporaria using masked and renamed predictors...
Predictions generated for Rana_temporaria.

Generating RF and XGBoost predictions for Lissotriton_helveticus using masked and renamed predictors...
Predictions generated for Lissotriton_helveticus.



In [70]:
import os
import joblib

# Directory to save predictions
predictions_dir = "C:/GIS_Course/MScThesis-MaviSantarelli/results/Model_Predictions"

# Ensure the directory exists
os.makedirs(predictions_dir, exist_ok=True)


In [71]:
# Save predictions for each species
for species in ["Rana_temporaria", "Lissotriton_helveticus"]:
    species_dir = os.path.join(predictions_dir, species)
    os.makedirs(species_dir, exist_ok=True)

    # Save Random Forest predictions
    rf_save_path = os.path.join(species_dir, f"{species}_RF_Predictions_Masked.pkl")
    joblib.dump(rf_predictions_all_species[species], rf_save_path)
    print(f"Random Forest predictions saved for {species} at {rf_save_path}")

    # Save XGBoost predictions
    xgb_save_path = os.path.join(species_dir, f"{species}_XGBoost_Predictions_Masked.pkl")
    joblib.dump(xgb_predictions_all_species[species], xgb_save_path)
    print(f"XGBoost predictions saved for {species} at {xgb_save_path}")


Random Forest predictions saved for Rana_temporaria at C:/GIS_Course/MScThesis-MaviSantarelli/results/Model_Predictions\Rana_temporaria\Rana_temporaria_RF_Predictions_Masked.pkl
XGBoost predictions saved for Rana_temporaria at C:/GIS_Course/MScThesis-MaviSantarelli/results/Model_Predictions\Rana_temporaria\Rana_temporaria_XGBoost_Predictions_Masked.pkl
Random Forest predictions saved for Lissotriton_helveticus at C:/GIS_Course/MScThesis-MaviSantarelli/results/Model_Predictions\Lissotriton_helveticus\Lissotriton_helveticus_RF_Predictions_Masked.pkl
XGBoost predictions saved for Lissotriton_helveticus at C:/GIS_Course/MScThesis-MaviSantarelli/results/Model_Predictions\Lissotriton_helveticus\Lissotriton_helveticus_XGBoost_Predictions_Masked.pkl


In [None]:
import pandas as pd

# Save predictions as CSV for each species
for species in ["Rana_temporaria", "Lissotriton_helveticus"]:
    species_dir = os.path.join(predictions_dir, species)

    # Save Random Forest predictions as CSV
    rf_csv_path = os.path.join(species_dir, f"{species}_RF_Predictions_Masked.csv")
    pd.DataFrame(rf_predictions_all_species[species], columns=["Habitat_Suitability"]).to_csv(rf_csv_path, index=False)
    print(f"Random Forest predictions saved as CSV for {species} at {rf_csv_path}")

    # Save XGBoost predictions as CSV
    xgb_csv_path = os.path.join(species_dir, f"{species}_XGBoost_Predictions_Masked.csv")
    pd.DataFrame(xgb_predictions_all_species[species], columns=["Habitat_Suitability"]).to_csv(xgb_csv_path, index=False)
    print(f"XGBoost predictions saved as CSV for {species} at {xgb_csv_path}")


Random Forest predictions saved as CSV for Rana_temporaria at C:/GIS_Course/MScThesis-MaviSantarelli/results/Model_Predictions\Rana_temporaria\Rana_temporaria_RF_Predictions_Masked.csv
XGBoost predictions saved as CSV for Rana_temporaria at C:/GIS_Course/MScThesis-MaviSantarelli/results/Model_Predictions\Rana_temporaria\Rana_temporaria_XGBoost_Predictions_Masked.csv


### References

Fithian, W., Elith, J., Hastie, T., & Keith, D. A. (2015). Bias correction in species distribution models: pooling survey and collection data for multiple species. *Methods in Ecology and Evolution*, 6(4), 424-438. https://doi.org/10.1111/2041-210X.12242

Phillips, S. J., Anderson, R. P., & Schapire, R. E. (2006). Maximum entropy modelling of species geographic distributions. *Ecological Modelling*, 190(3–4), 231–259. https://doi.org/10.1016/j.ecolmodel.2005.03.026

Inman, R., Franklin, J., Esque, T., & Nussear, K. (2021). Comparing sample bias correction methods for species distribution modeling using virtual species. *Ecosphere*, 12(3). https://doi.org/10.1002/ecs2.3422

Robinson, O. J., Ruiz-Gutierrez, V., & Fink, D. (2017). Correcting for bias in distribution modeling for rare species using citizen science data. *Diversity and Distributions*, 23(1), 1-12. https://doi.org/10.1111/ddi.12698
