# Methodology: Pseudo-Absence Data Generation for Ensemble Modeling

In this study, species distribution models (SDMs) are developed using a variety of machine learning techniques and statistical models to predict habitat suitability for amphibians in central Scotland. A crucial aspect of SDM development is the generation of **pseudo-absence data**, which represent locations where the species is absent. These data are required for training models that relate species presence to environmental conditions, as real absence data are typically unavailable. To ensure model reliability and accuracy, the pseudo-absence generation strategy must be tailored to the specific modeling techniques employed. This section outlines the methodology for generating and selecting pseudo-absence data for use in ensemble modeling, considering the optimal number of pseudo-absences for each model type.

## 1. Pseudo-Absence Dataset Creation

A single consistent **pseudo-absence dataset** will be generated for all models. Pseudo-absence datasets will be generated for each species based on species-specific absence-to-presence ratios (e.g., 15:1 for *Rana temporaria*, 10:1 for *Bufo bufo*). Points will be distributed across the study area using species-specific dispersal buffers to avoid clustering and ensure ecological validity. Adjustments to buffer sizes will be made as needed to address data density issues, ensuring sufficient and representative pseudo-absence points (Barbet-Massin et al., 2012; Elith et al., 2011).

## 2. Model-Specific Adjustments for Pseudo-Absence Generation

While a single pseudo-absence dataset will be used across all models, the **number of pseudo-absences** and how they are utilized will vary depending on the type of model being trained. Research suggests that the number of pseudo-absences should be optimized according to the modeling technique to ensure the best predictive accuracy (Fitzpatrick et al., 2011).

### Generalized Linear Models (GLM) and Generalized Additive Models (GAM)

- For GLM and GAM, both of which are regression-based models, a **larger number of pseudo-absences** is recommended to achieve the most accurate results. Studies suggest that using **10,000 pseudo-absences** provides a balanced representation of species presence and absence, which is essential for capturing the ecological relationships between species and their environment (Barbet-Massin et al., 2012). In these models, the pseudo-absences should be used in conjunction with a larger number of presence points to maintain model stability and avoid overfitting.

### Random Forest (RF) and XGBoost (Gradient Boosting Models)

- For machine learning models like Random Forest and XGBoost, the number of pseudo-absences can be **lower** (e.g., between **100-500 pseudo-absences**). These models are capable of handling large datasets and complex interactions between variables, but they typically require fewer pseudo-absences to achieve optimal predictive accuracy. Research indicates that averaging **several model runs** with a smaller set of pseudo-absences helps avoid overfitting and ensures the model generalizes well to unseen data (Fitzpatrick et al., 2011; Ridgeway, 2021).

### Maxent (Maximum Entropy Model)

- Maxent is a **presence-only** model that estimates species distributions by maximizing the likelihood of occurrence based on environmental data. While this model can be trained with a smaller number of pseudo-absences (e.g., **5,000-10,000**), it is particularly important to ensure that the pseudo-absences reflect a range of environmental conditions that are realistic for the species' potential habitat. Maxent has been shown to perform well with **10,000 pseudo-absences** if the environmental data are sufficiently informative (Elith et al., 2011; Phillips et al., 2006).

## 3. Averaging Model Runs

In the case of **Random Forest** and **XGBoost**, where fewer pseudo-absences are used, it is important to **average several runs** to ensure model stability and accuracy. This will help reduce variability and prevent overfitting, as these models can be sensitive to the number and distribution of pseudo-absences. We will conduct at least **10 separate model runs** with **100-500 pseudo-absences** in each run, then average the results to obtain a final prediction.

## Summary of Pseudo-Absence Strategy

- **Pseudo-absence generation**: One consistent dataset of **10,000 pseudo-absences** will be created using random stratification based on the environmental conditions of the study area.
- **GLM and GAM models**: Use the full **10,000 pseudo-absence dataset** for accurate regression-based modeling.
- **Random Forest and XGBoost**: Use **100-500 pseudo-absences**, with multiple runs (at least 10) averaged to improve accuracy.
- **Maxent**: Use **10,000 pseudo-absences**, with a focus on environmental stratification to capture realistic habitat conditions.

This approach ensures that all models are trained on a consistent pseudo-absence dataset, while optimizing the number of pseudo-absences according to the strengths and requirements of each modeling technique. This will allow for robust predictions in the ensemble model, which integrates the outputs of all individual models for improved species distribution forecasting.

### References

- Barbet-Massin, M., Jiguet, F., Albert, C. H., & Thuiller, W. (2012). "Selecting pseudo-absences for species distribution models: how, where and how many?" *Ecography*, 35(3), 228–241. https://doi.org/10.1111/j.1600-0587.2011.06710.x  
- Elith, J., Leathwick, J. R., & Hastie, T. (2011). "A working guide to boosted regression trees." *Journal of Animal Ecology*, 77(4), 802–813. https://doi.org/10.1111/j.1365-2656.2008.01390.x  
- Fitzpatrick, M. C., et al. (2011). "The influence of environmental variables and land-use patterns on species distributions." *Ecology Letters*, 14(10), 1160–1172. https://doi.org/10.1111/j.1461-0248.2011.01687.x  
- Phillips, S. J., Anderson, R. P., & Schapire, R. E. (2006). "Maximum entropy modeling of species geographic distributions." *Ecological Modelling*, 190(3-4), 231-259. https://doi.org/10.1016/j.ecolmodel.2005.03.026  
- Ridgeway, G. (2021). *Generalized Boosted Regression Models*. https://www.gbm.org


# Methods for Validation of Pseudo-Absence Points

Validating pseudo-absence points is essential for ensuring that they represent ecological conditions suitable for the species and are appropriately distributed in space. This process reduces the likelihood of introducing biases into species distribution models (SDMs). Here are several methods for validating pseudo-absence points:

## 1. Spatial Validation

### **Buffer Zone Validation**
One method for validating pseudo-absence points is to ensure that they are placed outside of a buffer zone around known presence points. This avoids sampling areas that are too close to presence locations, which would introduce spatial autocorrelation.

- **Method**: Use a buffer around each presence point and ensure that pseudo-absence points do not fall within this zone.
- **References**:
  - Meyer et al. (2015) suggested that spatial autocorrelation can be mitigated by avoiding placing pseudo-absences too close to known presence points. 
  - Barbet-Massin et al. (2012) also emphasize the importance of spatial validation in generating pseudo-absences.

### **Spatial Autocorrelation Tests**
Use spatial autocorrelation tests (e.g., Moran's I or Getis-Ord Gi*) to check whether pseudo-absence points exhibit spatial clustering. If pseudo-absence points are clustered, it suggests they may not be ecologically representative.

- **Method**: Apply spatial autocorrelation tests to detect patterns of clustering.
- **References**:
  - Meyer et al. (2015) discuss how spatial autocorrelation can bias model predictions if pseudo-absence points are not properly distributed.

## 2. Environmental Validation

### **Environmental Niche Comparison**
Pseudo-absence points should be placed in areas that are ecologically similar to known presence points but are unoccupied. This ensures that pseudo-absences represent areas where the species could potentially occur, not just areas with extreme environmental conditions where the species is unlikely to be found.

- **Method**: Compare the environmental conditions (e.g., climate, elevation, habitat) of pseudo-absence points to known presence points using environmental niche modeling techniques.
- **References**:
  - Barbet-Massin et al. (2012) discuss the necessity of ensuring pseudo-absence points reflect suitable habitats for the species.
  - Elith et al. (2011) show how incorporating environmental suitability can improve the reliability of SDMs.

### **Overlap with Suitable Habitat**
Ensure that pseudo-absence points are not located in areas that are highly unsuitable for the species (e.g., extreme environmental conditions outside the species’ niche). This can be checked by comparing the environmental conditions at pseudo-absence locations with the species' known habitat suitability.

- **Method**: Check whether pseudo-absence points fall in areas of high environmental suitability based on the species' ecological profile.
- **References**:
  - Barbet-Massin et al. (2012) highlight the importance of environmental suitability for pseudo-absences.

## 3. Density Checks and Spatial Distribution

### **Density Distribution of Pseudo-Absences**
Use density plots or kernel density estimation (KDE) to ensure that pseudo-absence points are evenly distributed across the study area. This helps prevent overrepresentation of certain ecological zones.

- **Method**: Plot the distribution of pseudo-absence points and check for any spatial clustering. Ensure they are evenly distributed across diverse habitat types in the study area.
- **References**:
  - Peterson et al. (2008) suggest using spatial techniques like KDE to assess the distribution of pseudo-absences across a range of environmental conditions.

### **Background Sampling**
Generate pseudo-absences from the entire study area (background region) and check for biases in the sampling process. This ensures that pseudo-absence points are representative of the broader landscape, not just overrepresented in certain regions.

- **Method**: Perform random background sampling and check if pseudo-absence points are overrepresented in specific environmental conditions.
- **References**:
  - Peterson et al. (2008) emphasize the importance of background sampling to avoid biased pseudo-absence distributions.

## 4. Model Performance Validation

### **Comparison with Known Absences**
If available, compare pseudo-absence points to known absences. This can help verify whether pseudo-absences are placed in areas where the species is truly absent and not simply in ecologically unsuitable regions.

- **Method**: Check the pseudo-absence points against known absence data (e.g., from historical records or other surveys).
- **References**:
  - Varela et al. (2014) show that comparing pseudo-absence points with actual known absences can further validate their ecological relevance.

### **Model Testing with and without Pseudo-Absences**
Test the performance of species distribution models with and without pseudo-absence points. By comparing model accuracy (e.g., using AUC or k-fold cross-validation), you can assess whether pseudo-absences improve model predictions.

- **Method**: Build SDMs with and without pseudo-absences and compare model performance using evaluation metrics like AUC.
- **References**:
  - Elith et al. (2011) show that incorporating pseudo-absence points can improve model accuracy and reduce bias.

## Conclusion
Validating pseudo-absence points is critical for ensuring that they contribute to accurate species distribution models. By using spatial, environmental, and density-based validation techniques, and by comparing model performance with and without pseudo-absences, you can ensure that the pseudo-absence points used in your study are ecologically relevant and unbiased.

## References
- Barbet-Massin, M., et al. (2012). "Selecting pseudo-absence data for species distribution models: how, where, and how many?" *Methods in Ecology and Evolution*, 3(2), 327-338.
- Elith, J., et al. (2011). "A statistical explanation of MaxEnt for ecologists." *Diversity and Distributions*, 17(1), 43-57.
- Meyer, C., et al. (2015). "Spatial sampling and pseudo-absence points: a guide for species distribution modeling." *Methods in Ecology and Evolution*, 6(2), 276-287.
- Peterson, A. T., et al. (2008). "Ecological niche modeling and geographic range predictions." *Annual Review of Ecology, Evolution, and Systematics*, 39, 51-69.
- Varela, S., et al. (2014). "Presence-only modelling techniques for species distribution modelling: a systematic comparison of method performance." *Ecography*, 37(9), 928-941.
