# Setting Up Model Training

## Table of Contents
1. [Introduction](#Introduction)
2. [Load Occurrence and Pseudo-Absence Data](#1.-Load-Occurrence-and-Pseudo-Absence-Data)
3. [Reproject Species Data](#2.-Reproject-Species-Data)
4. [Load Predictors Data](#3.-Load-Predictors-Data)
5. [Integrate Predictors with Occurrence Data](#4.-Integrate-Predictors-with-Occurrence-Data)
6. [Validate Extracted Occurrence Data](#5.-Validate-Extracted-Occurrence-Data)
7. [Handling Missing Predictor Data Across Occurrences](#6.-Handling-Missing-Predictor-Data-Across-Occurrences)
8. [Integrate Predictors with Pseudo-Absences](#7.-Integrate-Predictors-with-Pseudo-Absences)

## Introduction
This notebook focuses on the initial phase of species distribution modelling (SDM): setting up the model training environment. At this stage, we integrate presence, pseudo-absence, and environmental predictor data to create the datasets required for model training and evaluation. This process ensures that the models are built on robust and ecologically valid data, enabling reliable habitat suitability predictions.

## **Objectives**
- Prepare and combine datasets containing species presence, pseudo-absence, and predictor values.
- Split the data into training and testing subsets for model training and validation.
- Implement k-fold cross-validation to ensure robust model evaluation.
- Explore the data to verify its quality and suitability for modelling.

## **Relevance to the Study**
Training accurate and reliable models is essential for predicting amphibian habitat suitability. By preparing the data carefully, this stage reduces the likelihood of errors and ensures the outputs reflect ecological realities. Integrating pseudo-absence data tailored to species-specific traits and standardised predictors provides a strong foundation for model performance and interpretability.

This notebook establishes the groundwork for:
- Running individual SDMs using selected algorithms.
- Generating reliable predictions for ensemble modelling.
- Creating habitat suitability maps that support biodiversity conservation in central Scotland.

## **Key Deliverables**
1. Combined and preprocessed dataset for model training.
2. Training and testing subsets to validate model predictions.
3. Exploratory data analysis (EDA) results to ensure data quality.
4. A k-fold cross-validation pipeline for robust evaluation.

By completing this notebook, we set the stage for accurate, reproducible modelling in subsequent phases of the study.

---


## 1. Load Occurrence and Pseudo-Absence Data

In [8]:
import geopandas as gpd

# File paths
occurrence_files = {
    "Rana_temporaria": "C:/GIS_Course/MScThesis-MaviSantarelli/data/OccurrenceData/OccurrenceDataperSpecies/Rana_temporaria.shp",
    "Bufo_bufo": "C:/GIS_Course/MScThesis-MaviSantarelli/data/OccurrenceData/OccurrenceDataperSpecies/Bufo_bufo.shp",
    "Lissotriton_helveticus": "C:/GIS_Course/MScThesis-MaviSantarelli/data/OccurrenceData/OccurrenceDataperSpecies/Lissotriton_helveticus.shp"
}

pseudo_absence_files = {
    "Rana_temporaria": "C:/GIS_Course/MScThesis-MaviSantarelli/data/PseudoAbsences/Rana_temporaria_pseudo_absences.shp",
    "Bufo_bufo": "C:/GIS_Course/MScThesis-MaviSantarelli/data/PseudoAbsences/Bufo_bufo_pseudo_absences_recalculated.shp",
    "Lissotriton_helveticus": "C:/GIS_Course/MScThesis-MaviSantarelli/data/PseudoAbsences/Lissotriton_helveticus_pseudo_absences.shp"
}

# Load occurrence data
occurrence_data = {species: gpd.read_file(path) for species, path in occurrence_files.items()}

# Load pseudo-absence data
pseudo_absence_data = {species: gpd.read_file(path) for species, path in pseudo_absence_files.items()}


## 2. Reproject Species Data

In [4]:
# Reproject all species to match raster CRS (EPSG:27700)
for species in occurrence_data:
    occurrence_data[species] = occurrence_data[species].to_crs("EPSG:27700")

# Verify CRS after reprojection
for species, gdf in occurrence_data.items():
    print(f"{species} Reprojected CRS: {gdf.crs}")


Rana_temporaria Reprojected CRS: EPSG:27700
Bufo_bufo Reprojected CRS: EPSG:27700
Lissotriton_helveticus Reprojected CRS: EPSG:27700


In [5]:
# Reproject pseudo-absence data to match raster CRS (EPSG:27700)
for species in pseudo_absence_data:
    pseudo_absence_data[species] = pseudo_absence_data[species].to_crs("EPSG:27700")

# Verify CRS after reprojection
for species, gdf in pseudo_absence_data.items():
    print(f"{species} Pseudo-Absences Reprojected CRS: {gdf.crs}")


Rana_temporaria Pseudo-Absences Reprojected CRS: EPSG:27700
Bufo_bufo Pseudo-Absences Reprojected CRS: EPSG:27700
Lissotriton_helveticus Pseudo-Absences Reprojected CRS: EPSG:27700


## 3. Load Predictors Data

In [6]:
import rasterio
from rasterio.plot import show

# File paths for reversed predictors
reversed_predictor_files = [
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Building_Density_Reversed.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/DistWater_Reversed.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/NOx_Stand_Reversed.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/RGS_Reversed.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Runoff_Coefficient_Standardised_Reversed.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Slope_Proj_Reversed.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/SoilMoisture_32bit_Reversed.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Traffic_Reversed.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Wood_Resample_Reversed.tif"
]

# File paths for non-reversed predictors
additional_predictor_files = [
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Grass_Stand.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/NDVI_median.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/NDVI_StDev.tif",
    "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/VegHeight.tif"
]

# Load reversed predictors
reversed_predictors = [rasterio.open(file) for file in reversed_predictor_files]

# Load additional predictors
additional_predictors = [rasterio.open(file) for file in additional_predictor_files]

all_predictors = reversed_predictors + additional_predictors

## 4. Integrate Predictors with Occurrence Data

In [6]:
import geopandas as gpd
import pandas as pd
import rasterio

# Function to extract predictor values
def extract_predictor_values(occurrences, predictors):
    coords = [(x, y) for x, y in zip(occurrences.geometry.x, occurrences.geometry.y)]
    values = {raster.name: [val[0] for val in raster.sample(coords)] for raster in predictors}
    return pd.DataFrame(values)

# Extract predictor values for all species
predictor_values = {}
for species, data in occurrence_data.items():
    print(f"Extracting predictor values for {species}...")
    predictor_values[species] = extract_predictor_values(data, all_predictors)

# Save extracted predictor values to CSV
for species, df in predictor_values.items():
    output_path = f"C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/{species}_predictor_values.csv"
    df.to_csv(output_path, index=False)
    print(f"Saved predictor values for {species} to {output_path}")

Extracting predictor values for Rana_temporaria...
Extracting predictor values for Bufo_bufo...
Extracting predictor values for Lissotriton_helveticus...
Saved predictor values for Rana_temporaria to C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Rana_temporaria_predictor_values.csv
Saved predictor values for Bufo_bufo to C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Bufo_bufo_predictor_values.csv
Saved predictor values for Lissotriton_helveticus to C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Lissotriton_helveticus_predictor_values.csv


---
### **What the Script Did**
This script performed the following tasks:
1. **Loaded Data**:
   - Occurrence data for three amphibian species (*Rana temporaria*, *Bufo bufo*, and *Lissotriton helveticus*).
   - Predictor raster files, including reversed and additional predictors.

2. **Extracted Predictor Values**:
   - Extracted predictor values at the spatial locations of occurrence points for each species.
   - Combined the predictor values into a tabular format, where each row represents a spatial point and each column corresponds to a predictor.

3. **Saved Extracted Values**:
   - Created CSV files for each species containing their respective occurrence data and associated predictor values.
   - Files were saved to:
     - `Rana_temporaria_predictor_values.csv`
     - `Bufo_bufo_predictor_values.csv`
     - `Lissotriton_helveticus_predictor_values.csv`

### **Next Steps**

#### **1. Validate Extracted Data**
- Open and inspect the saved CSV files to ensure:
  - All spatial points have associated predictor values.
  - Column names align with predictor file names.
  - There are no missing or erroneous values.

#### **2. Prepare Training and Testing Datasets**
- Split the data for each species into training and testing subsets (e.g., 70% training, 30% testing).
- Save the resulting subsets for reproducibility.

#### **3. Perform Exploratory Data Analysis (EDA)**
- Explore the distribution of predictor values across the data.
- Check for potential issues such as:
  - Multicollinearity between predictors.
  - Outliers or irregularities in the data.

#### **4. Proceed to Model Training**
- Use the prepared datasets to train individual species distribution models.
- Implement cross-validation to ensure robust evaluation.

By completing these next steps, we will transition from data preparation to modelling and evaluation.

---


## 5. Validate Extracted Occurrence Data
### Load Extracted CSV Files

In [17]:
import pandas as pd

# Load CSV files
rana_data = pd.read_csv("C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Rana_temporaria_cleaned.csv")
bufo_data = pd.read_csv("C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Bufo_bufo_cleaned.csv")
lissotriton_data = pd.read_csv("C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Lissotriton_helveticus_cleaned.csv")

### Inspect Data


1. Inspect general statistics for each predictor:

In [19]:
print(rana_data.describe())
print(bufo_data.describe())
print(lissotriton_data.describe())

       C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Building_Density_Reversed.tif  \
count                                        1110.000000                                                     
mean                                            0.942304                                                     
std                                             0.107376                                                     
min                                             0.423974                                                     
25%                                             0.930551                                                     
50%                                             1.000000                                                     
75%                                             1.000000                                                     
max                                             1.000000                                                     

       C:

2. Check for missing values:

In [20]:
# Example for Rana temporaria
missing_count = rana_data.isnull().sum()
total_points = len(rana_data)
missing_proportion = missing_count / total_points * 100
print(missing_proportion)

C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Building_Density_Reversed.tif                   0.000000
C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/DistWater_Reversed.tif                          0.000000
C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/NOx_Stand_Reversed.tif                          0.000000
C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/RGS_Reversed.tif                                0.000000
C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Runoff_Coefficient_Standardised_Reversed.tif    0.000000
C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Slope_Proj_Reversed.tif                         0.000000
C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/SoilMoisture_32bit_Reversed.tif                 0.000000
C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Traffic_Reversed.tif                     

In [21]:
# Example for Rana temporaria
missing_count = bufo_data.isnull().sum()
total_points = len(bufo_data)
missing_proportion = missing_count / total_points * 100
print(missing_proportion)

C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Building_Density_Reversed.tif                   0.000000
C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/DistWater_Reversed.tif                          0.000000
C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/NOx_Stand_Reversed.tif                          0.000000
C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/RGS_Reversed.tif                                0.000000
C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Runoff_Coefficient_Standardised_Reversed.tif    0.000000
C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Slope_Proj_Reversed.tif                         0.000000
C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/SoilMoisture_32bit_Reversed.tif                 0.000000
C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Traffic_Reversed.tif                     

In [22]:
# Example for Rana temporaria
missing_count = lissotriton_data.isnull().sum()
total_points = len(lissotriton_data)
missing_proportion = missing_count / total_points * 100
print(missing_proportion)

C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Building_Density_Reversed.tif                   0.000000
C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/DistWater_Reversed.tif                          0.000000
C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/NOx_Stand_Reversed.tif                          0.000000
C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/RGS_Reversed.tif                                0.000000
C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Runoff_Coefficient_Standardised_Reversed.tif    0.000000
C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Slope_Proj_Reversed.tif                         0.000000
C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/SoilMoisture_32bit_Reversed.tif                 0.000000
C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Traffic_Reversed.tif                     

---

## **Issue Report: Handling Missing Data in Integrated Predictors**

During the integration of species occurrence data with environmental predictors for habitat suitability modelling, an issue was identified with missing predictor values in the integrated dataset. These missing values were initially flagged as `-9999` in the raster predictors, representing NoData values in the original rasters. Masking these values during data processing introduced a small proportion of missing data for some occurrence points.

## **Observed Missing Data**
The missing values were observed across specific predictors and species datasets. The following summarises the missing data proportions for each predictor and species:

### **Rana temporaria**
- `Traffic_Reversed`: 0.54% missing.
- All other predictors: 0% missing.

### **Bufo bufo**
- `Traffic_Reversed`: 1.4% missing.
- `NDVI_median`: 0.7% missing.
- `NDVI_StDev`: 0.56% missing.
- All other predictors: 0% missing.

### **Lissotriton helveticus**
- `Traffic_Reversed`: 0.59% missing.
- `NDVI_median`: 0.3% missing.
- `NDVI_StDev`: 0.3% missing.
- All other predictors: 0% missing.

The missing values are believed to result from species occurrence points located near the edges of raster datasets or in areas where rasters do not fully overlap.

---

## **Options for Addressing Missing Data**

### **1. Retain Missing Data**
- **Approach:** Leave missing values (`NaN`) in the dataset and use modelling algorithms capable of handling missing data.
- **Benefits:**
  - Preserves the full set of occurrence points.
  - Suitable for machine learning algorithms such as XGBoost that natively support missing values.
- **Drawbacks:**
  - Potential for inconsistent predictor behaviour during modelling.
  - Possible bias if missing data is spatially or environmentally clustered.


### **2. Remove Rows with Missing Values**
- **Approach:** Exclude occurrence points with missing values from the dataset.
- **Benefits:**
  - Ensures a clean and complete dataset for modelling and statistical analysis.
  - Simplifies the modelling process by eliminating the need to handle missing values.
- **Drawbacks:**
  - Reduces sample size, potentially excluding valuable data.
  - Risk of introducing spatial or environmental bias if missing data is concentrated in specific areas.


### **3. Impute Missing Values**
- **Approach:** Replace missing values with derived values, such as column means, medians, or through spatial interpolation.
- **Benefits:**
  - Retains all occurrence points in the dataset.
  - Produces a complete dataset for use in modelling and analysis.
- **Drawbacks:**
  - Introduces synthetic data that may not accurately reflect real-world conditions.
  - May skew results, especially if missing values are spatially or environmentally biased.

---



## 6. Handling Missing Predictor Data Across Occurrences

### **Decision**
A consistent data processing methodology will be adopted for handling missing predictor values across all species in the study (*Bufo bufo*, *Rana temporaria*, and *Lissotriton helveticus*). Missing data will be addressed using **imputation methods**, ensuring uniformity in preprocessing steps for all species.

### **Rationale**

1. **Ensuring Comparability**:
   - A uniform approach allows for direct comparison of model outputs across species.
   - Differing methods could introduce inconsistencies, confounding interpretation of results.

2. **Data Retention**:
   - *Lissotriton helveticus* has a smaller dataset (337 occurrences), and removing records with missing values could exacerbate data scarcity.
   - Imputation preserves all occurrence points, maximising the available data for model training.

3. **Minimising Bias**:
   - Imputation methods, such as mean or regression imputation, reduce the risk of introducing bias compared to outright removal of rows or columns.
   - This approach is particularly useful for predictors with low variability or missing data that is Missing at Random (MAR).

4. **Flexibility for Adjustment**:
   - If issues arise during model evaluation (e.g., poor performance for *Bufo bufo* due to its larger dataset), alternative approaches (e.g., row removal) can be explored post hoc and documented.

### **Supporting Evidence**
The decision to use imputation is supported by academic literature on data preprocessing for ecological modelling:
- **Nakagawa & Freckleton (2008)** discuss the importance of addressing missing data in ecological datasets and highlight imputation as a robust method for retaining data integrity.
- **Sillero & Barbosa (2021)** recommend imputation methods to avoid unnecessary data loss, particularly in species distribution modelling, where datasets are often limited.
- **Stekhoven & Bühlmann (2012)** advocate the use of advanced imputation methods (e.g., k-nearest neighbours, random forest-based imputation) to improve ecological modelling outcomes.

### **Methodology**
- **Imputation Technique**: Missing values will be replaced using mean imputation or advanced methods if necessary (e.g., k-nearest neighbours or regression imputation).
- **Implementation**: The chosen method will be applied consistently to all species' predictor datasets.
- **Validation**: Predictor distributions will be checked post-imputation to ensure ecological validity.

### **Future Considerations**
- **Evaluation of Model Outputs**: If models for species with larger datasets (*Bufo bufo*, *Rana temporaria*) indicate poor performance, alternative preprocessing (e.g., removing records with missing values) will be evaluated and documented.
- **Transparency**: All preprocessing steps, including imputation, will be clearly reported to ensure reproducibility and scientific integrity.


In [26]:
# Import necessary libraries
import pandas as pd
import os

# Paths to the occurrence datasets
bufo_file = "C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Bufo_bufo_cleaned.csv"
rana_file = "C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Rana_temporaria_cleaned.csv"
lissotriton_file = "C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Lissotriton_helveticus_cleaned.csv"

# Load datasets into a dictionary
datasets = {
    "Bufo bufo": pd.read_csv(bufo_file),
    "Rana temporaria": pd.read_csv(rana_file),
    "Lissotriton helveticus": pd.read_csv(lissotriton_file),
}

# Directory for saving imputed files
output_dir = "C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Imputed"
os.makedirs(output_dir, exist_ok=True)

# Impute missing values and save results
for species, data in datasets.items():
    # Impute missing values with the mean of each column
    imputed_data = data.fillna(data.mean())
    
    # Save the imputed dataset to a CSV file
    output_path = os.path.join(output_dir, f"imputed_{species.replace(' ', '_')}.csv")
    imputed_data.to_csv(output_path, index=False)
    
    print(f"Imputed data saved for {species} to: {output_path}")


Imputed data saved for Bufo bufo to: C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Imputed\imputed_Bufo_bufo.csv
Imputed data saved for Rana temporaria to: C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Imputed\imputed_Rana_temporaria.csv
Imputed data saved for Lissotriton helveticus to: C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Imputed\imputed_Lissotriton_helveticus.csv


### Summary of Data Processing

The provided Python script addressed missing data in species occurrence datasets by performing the following steps:

1. **Imputation of Missing Values**: 
   - Missing values (`NaN`) in the predictor variables were replaced with the mean of the respective columns. This method is commonly used in ecological and statistical modelling when missing data is assumed to be missing at random (MAR).

2. **Standardisation Across Datasets**: 
   - The same imputation method was applied to all three species datasets (*Bufo bufo*, *Rana temporaria*, and *Lissotriton helveticus*) to ensure consistency in data processing.

3. **Output Generation**: 
   - The imputed datasets were saved as separate CSV files in a designated directory (`C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Imputed`) for further analysis.

This approach ensures that all datasets are complete and ready for use in species distribution modelling (SDM), while maintaining transparency and reproducibility in data processing. The decision to impute missing values with the mean is supported by its simplicity and ability to retain data structure, but it may introduce bias if missing values are not random.

### Verify Imputed Data


In [27]:
import pandas as pd

# File paths of the imputed CSVs
imputed_files = {
    "Bufo bufo": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Imputed/imputed_Bufo_bufo.csv",
    "Rana temporaria": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Imputed/imputed_Rana_temporaria.csv",
    "Lissotriton helveticus": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Imputed/imputed_Lissotriton_helveticus.csv"
}

# Function to verify imputed data
def verify_imputed_data(file_path, species_name):
    print(f"Verifying data for {species_name}")
    data = pd.read_csv(file_path)
    
    # Check for missing values
    missing_summary = data.isnull().sum()
    print("\nMissing Values Summary:")
    print(missing_summary[missing_summary > 0])  # Print only columns with missing values
    
    # Display data statistics
    print("\nSummary Statistics:")
    print(data.describe())  # Descriptive statistics for numeric columns
    
    # Check for unexpected ranges
    print("\nChecking for unexpected values...")
    for column in data.columns:
        if data[column].dtype in ['float64', 'int64']:
            print(f"Column '{column}' - Min: {data[column].min()}, Max: {data[column].max()}")

    # Check structure of the data
    print("\nData structure:")
    print(data.info())

# Loop through the files and verify each
for species, file_path in imputed_files.items():
    print(f"\n{'-'*50}\n")
    verify_imputed_data(file_path, species)



--------------------------------------------------

Verifying data for Bufo bufo

Missing Values Summary:
Series([], dtype: int64)

Summary Statistics:
       C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Building_Density_Reversed.tif  \
count                                         716.000000                                                     
mean                                            0.984106                                                     
std                                             0.048305                                                     
min                                             0.498254                                                     
25%                                             0.996873                                                     
50%                                             1.000000                                                     
75%                                             1.000000                     

---

### Observations from the Verified Data

Based on the verification results for the imputed occurrence datasets for *Bufo bufo*, *Rana temporaria*, and *Lissotriton helveticus*, the following conclusions can be made:

#### No Missing Values:
- All datasets have no missing values after imputation, confirming the preprocessing was successful.

#### Data Structure:
- Each dataset contains 13 columns corresponding to the environmental predictors.
- The data types are consistent (`float64` for all columns), and the structure is as expected.

#### Predictor Ranges:
- The predictor values fall within reasonable ranges:
  - Most predictors range between `0.0` and `1.0` as they have been standardised.
  - Some predictors, like `Traffic_Reversed.tif`, have lower minimum values (e.g., 0.052 for *Rana temporaria* and 0.243 for *Lissotriton helveticus*), which align with expected variations in traffic intensity or other environmental factors.

#### Occurrence Counts:
- *Bufo bufo*: 716 records
- *Rana temporaria*: 1110 records
- *Lissotriton helveticus*: 337 records

These counts confirm the data size differences across species, which could influence modelling strategies.


## 7. Integrate Predictors with Pseudo-Absences

In [9]:
import geopandas as gpd
import pandas as pd
import rasterio

# Function to extract predictor values
def extract_predictor_values(points, predictors):
    coords = [(x, y) for x, y in zip(points.geometry.x, points.geometry.y)]
    values = {raster.name: [val[0] for val in raster.sample(coords)] for raster in predictors}
    return pd.DataFrame(values)

# Extract predictor values for all pseudo-absence data
pseudo_absence_predictor_values = {}
for species, data in pseudo_absence_data.items():
    print(f"Extracting predictor values for pseudo-absences of {species}...")
    pseudo_absence_predictor_values[species] = extract_predictor_values(data, all_predictors)

# Save extracted predictor values to CSV
for species, df in pseudo_absence_predictor_values.items():
    output_path = f"C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/{species}_pseudo_absence_predictor_values.csv"
    df.to_csv(output_path, index=False)
    print(f"Saved pseudo-absence predictor values for {species} to {output_path}")


Extracting predictor values for pseudo-absences of Rana_temporaria...
Extracting predictor values for pseudo-absences of Bufo_bufo...
Extracting predictor values for pseudo-absences of Lissotriton_helveticus...
Saved pseudo-absence predictor values for Rana_temporaria to C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Rana_temporaria_pseudo_absence_predictor_values.csv
Saved pseudo-absence predictor values for Bufo_bufo to C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Bufo_bufo_pseudo_absence_predictor_values.csv
Saved pseudo-absence predictor values for Lissotriton_helveticus to C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Lissotriton_helveticus_pseudo_absence_predictor_values.csv


## 8. Validate Extracted Pseudo-Absence Data

In [11]:
# File paths to saved pseudo-absence predictor CSVs
pseudo_absence_csvs = {
    "Rana_temporaria": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Rana_temporaria_pseudo_absence_predictor_values.csv",
    "Bufo_bufo": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Bufo_bufo_pseudo_absence_predictor_values.csv",
    "Lissotriton_helveticus": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Lissotriton_helveticus_pseudo_absence_predictor_values.csv"
}

# Validate extracted pseudo-absence predictor values
for species, filepath in pseudo_absence_csvs.items():
    print(f"Validating pseudo-absence predictor values for {species}...")
    # Load the CSV
    df = pd.read_csv(filepath)
    # Display basic statistics and checks
    print(df.describe())  # Summary statistics
    print("Missing values per column:")
    print(df.isna().sum())  # Check for missing values
    print("-" * 40)


Validating pseudo-absence predictor values for Rana_temporaria...
       C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Building_Density_Reversed.tif  \
count                                       11903.000000                                                     
mean                                           -9.925178                                                     
std                                           330.311930                                                     
min                                         -9999.000000                                                     
25%                                             1.000000                                                     
50%                                             1.000000                                                     
75%                                             1.000000                                                     
max                                             1.0000

In [12]:
# Check for -9999 values in the pseudo-absence predictor data
for species, filepath in pseudo_absence_csvs.items():
    print(f"Checking for -9999 values in {species}...")
    df = pd.read_csv(filepath)
    print((df == -9999).sum())  # Count occurrences of -9999 in each column
    print("-" * 40)


Checking for -9999 values in Rana_temporaria...
C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Building_Density_Reversed.tif                   13
C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/DistWater_Reversed.tif                           5
C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/NOx_Stand_Reversed.tif                           6
C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/RGS_Reversed.tif                                 6
C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Runoff_Coefficient_Standardised_Reversed.tif    18
C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Slope_Proj_Reversed.tif                          5
C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/SoilMoisture_32bit_Reversed.tif                 20
C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Input/Reversed/Traffic_Reversed.tif               

## 9. Handling Missing Predictor Data Across Pseudo-Absences

## References

1. Nakagawa, S., & Freckleton, R. P. (2008). Missing inaction: The dangers of ignoring missing data. *Trends in Ecology & Evolution, 23*(11), 592-596. https://doi.org/10.1016/j.tree.2008.06.014  
2. Sillero, N., & Barbosa, A. M. (2021). Common mistakes in ecological niche models. *International Journal of Geographical Information Science, 35*(2), 213-226. https://doi.org/10.1080/13658816.2020.1798968  
3. Stekhoven, D. J., & Bühlmann, P. (2012). MissForest—non-parametric missing value imputation for mixed-type data. *Bioinformatics, 28*(1), 112-118. https://doi.org/10.1093/bioinformatics/btr597  
