# Setting Up Modelling Environment: Overview

## 1. Generating Pseudo-Absence Data

To complement presence-only data, pseudo-absences were generated based on species-specific traits and study area constraints. The methodology ensures ecological validity and alignment with best practices in SDMs.

### **Methodology**
- Buffers were applied around presence points to prevent clustering:
  - *Rana temporaria*: 2000 m buffer
  - *Bufo bufo*: 5000 m buffer (adjusted to 3000 m due to limitations)
  - *Lissotriton helveticus*: 1000 m buffer
- Species-specific absence-to-presence ratios were used:
  - *Rana temporaria*: 15:1
  - *Bufo bufo*: 10:1
  - *Lissotriton helveticus*: 10:1
- The generation process was implemented in the notebook [1_Data_SpeciesOccurrence.ipynb](https://github.com/Mavisan6/MScThesis-MaviSantarelli/blob/main/notebooks/1_Datasets/1_Data_SpeciesOccurrance.ipynb).

### **Validation Steps**
1. **Buffer Checks:** Ensured spatial separation between presence and pseudo-absence points to reduce spatial autocorrelation.
2. **Kernel Density Estimation (KDE):** Assessed uniformity of presence data distribution. KDE identified overrepresentation around Edinburgh for *Rana temporaria* and *Bufo bufo* due to survey bias.
3. **DBSCAN Cluster Analysis:** Validated the spatial distribution of pseudo-absence points, ensuring they were well-dispersed and ecologically meaningful.

### **Challenges and Adjustments**
- Large dispersal buffers (e.g., *Bufo bufo*) limited pseudo-absence generation. Buffers were reduced to maintain ecological relevance and data usability.
- Potential survey bias around Edinburgh will be further evaluated during model validation.

---

## 2. Preparing Predictor Data

### **Methodology**
- Environmental predictors were prepared and standardised to ensure compatibility and ecological relevance. Key steps included:
  1. **Standardisation:** Predictor layers were normalised to a 0 to 1 scale, where:
     - **0** represents the lowest intensity of the variable.
     - **1** represents the highest intensity of the variable.
  2. **Consistency in Extent and Resolution:** All rasters were clipped to the same study area extent and resampled to a resolution of 30m to ensure alignment and comparability.
  3. **Projection:** Rasters were reprojected to the British National Grid (EPSG: 27700) for consistent spatial referencing.
  4. **Handling NoData Values:** Missing data across all rasters were unified with a value of -9999 to avoid inconsistencies and ensure compatibility with SDM tools.

- Reverse transformations were applied where necessary to ensure ecological alignment (e.g., high values representing suitability).
- The detailed methodology is documented in [2_Predictors_Methods.ipynb](https://github.com/Mavisan6/MScThesis-MaviSantarelli/blob/main/notebooks/1_Datasets/2_Predictors_Methods.ipynb), and the full preprocessing workflow is available in [3_Predictors_Processing.ipynb](https://github.com/Mavisan6/MScThesis-MaviSantarelli/blob/main/notebooks/1_Datasets/3_Predictors_Processing.ipynb).

### **Why These Steps Matter**
- Ensuring standardisation, alignment, and consistent handling of data minimises errors and enhances the ecological validity of model predictions.
- Misaligned or inconsistent predictors could compromise model accuracy and reliability.

---

## 3. Identifying Modelling Methodology

### **Overview**
- A review of modelling approaches identified algorithms suitable for ensemble SDM:
  - **GLM/GAM:** Use full pseudo-absence datasets for balanced representation.
  - **Random Forest/XGBoost:** Employ subsets of 100–500 pseudo-absences per run to enhance model stability.
  - **MaxEnt:** Stratified environmental sampling ensures realistic environmental gradients.

- Detailed methodology is available in [SupplementaryMethodsMaterial.ipynb](https://github.com/Mavisan6/MScThesis-MaviSantarelli/blob/main/notebooks/Supplementary%20Material/SupplementaryMethodsMaterial.ipynb).

### **Next Steps**
- Train ensemble models using prepared datasets.
- Evaluate model performance and refine datasets iteratively.

---
## Future Work in Modelling Notebooks

### **4. Model Training Set Up**
- Integrate the prepared presence and pseudo-absence datasets with standardised predictors.
- Split datasets into training and testing subsets (e.g., 70% training, 30% testing).
- Use k-fold cross-validation to evaluate model performance and ensure robustness.

### **5. Run Models**
- Train individual models using predefined strategies:
  - **GLM/GAM:** Utilise the full pseudo-absence dataset.
  - **RF/XGBoost:** Perform multiple runs with smaller subsets of pseudo-absences.
  - **MaxEnt:** Emphasise stratified sampling to ensure realistic habitat representation.
- Evaluate model outputs using metrics like AUC, TSS, and sensitivity.

### **6. Ensemble Modelling**
- Combine predictions from individual models using:
  - Weighted averages based on evaluation metrics (e.g., AUC, TSS).
  - Unweighted averages if all models contribute equally.
- Evaluate ensemble model performance against individual models to ensure robustness and accuracy.

### **7. Habitat Suitability Mapping**
- Apply the ensemble model to predict habitat suitability across the study area.
- Generate continuous suitability maps (values 0–1) and classify into binary maps (suitable/unsuitable) using optimal thresholds (e.g., maximising TSS).
- Visualise and validate maps to ensure alignment with ecological expectations and independent data, if available.

### **7. Model Validation and Refinement**
- Validate predictions against known ecological ranges and independent datasets where possible.
- Use response curves to interpret variable importance and assess model predictions.
- Iteratively refine models based on validation outcomes to improve ecological realism.

By following these steps, the subsequent modelling notebooks will systematically build on the prepared data and methodology to produce robust species distribution models, ensemble predictions, and actionable habitat suitability maps.