# Setting Up Model Training: Data Partitioning

## Table of Contents
1. [Load Datasets](#1.-Load-Datasets)
2. [Combine Datasets](#2.-Combine-Datasets)
3. [Subsample Pseudo-Absences for Each Model](#3.-Subsample-Pseudo-Absences-for-Each-Model)
4. [Validate the Data](#4.-Validate-the-Data)

## 1. Load Datasets

In [1]:
import pandas as pd

# File paths for occurrences
imputed_occurrence_files = {
    "Bufo bufo": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Imputed/imputed_Bufo_bufo.csv",
    "Rana temporaria": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Imputed/imputed_Rana_temporaria.csv",
    "Lissotriton helveticus": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Imputed/imputed_Lissotriton_helveticus.csv"
}

# File paths for pseudo-absences
imputed_pseudo_absence_files = {
    "Bufo bufo": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Bufo_bufo_pseudo_absence_predictor_values_imputed.csv",
    "Rana temporaria": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Rana_temporaria_pseudo_absence_predictor_values_imputed.csv",
    "Lissotriton helveticus": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Lissotriton_helveticus_pseudo_absence_predictor_values_imputed.csv"
}

# Load datasets into dictionaries
occurrence_data = {species: pd.read_csv(filepath) for species, filepath in imputed_occurrence_files.items()}
pseudo_absence_data = {species: pd.read_csv(filepath) for species, filepath in imputed_pseudo_absence_files.items()}


## 2. Combine Datasets

In [2]:
# Combine occurrence and pseudo-absence data for each species
combined_data = {}
for species in occurrence_data.keys():
    print(f"Combining data for {species}...")

    # Add a label column: 1 for occurrences, 0 for pseudo-absences
    occurrence_data[species]['label'] = 1
    pseudo_absence_data[species]['label'] = 0

    # Combine the datasets without subsampling
    combined_data[species] = pd.concat(
        [occurrence_data[species], pseudo_absence_data[species]],
        ignore_index=True
    )

    # Save the full combined dataset
    output_path = f"C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/{species}_combined_full.csv"
    combined_data[species].to_csv(output_path, index=False)
    print(f"Full combined dataset saved for {species} at {output_path}")


Combining data for Bufo bufo...
Full combined dataset saved for Bufo bufo at C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Bufo bufo_combined_full.csv
Combining data for Rana temporaria...
Full combined dataset saved for Rana temporaria at C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Rana temporaria_combined_full.csv
Combining data for Lissotriton helveticus...
Full combined dataset saved for Lissotriton helveticus at C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Lissotriton helveticus_combined_full.csv


## 3. Subsample Pseudo-Absences for Each Model

### Overview
This step involves preparing datasets by subsampling pseudo-absence points based on the requirements of different modeling techniques. Pseudo-absence data are essential for species distribution models (SDMs) to represent areas where the species is absent, thereby allowing the model to learn ecological relationships between presence and absence. However, the number of pseudo-absences and how they are utilized depend on the modeling approach.

This study adopts a **compromise methodology** that combines literature-based recommendations with adjustments required due to challenges encountered in generating ecologically plausible pseudo-absence points. Specifically, some species had significantly more presence points than others, and large ecological dispersal buffers limited pseudo-absence generation to avoid clustering. As a result, while the methodology adheres to established guidelines, certain deviations were necessary to maintain ecological validity.

### Methodology
Each model type has specific requirements for pseudo-absence data, and the subsampling process reflects both best practices from the literature and adjustments based on the study's findings:

#### Generalized Linear Models (GLM) and Generalized Additive Models (GAM):
  - **Pseudo-Absence Strategy:** Use all available pseudo-absence points, even if the ratio is slightly below 10:1, due to ecological constraints during pseudo-absence generation. For example, *Bufo bufo* pseudo-absences are slightly fewer than the 10:1 target due to the reduced buffer distance required to avoid clustering.
  - **Rationale:** Larger pseudo-absence datasets remain beneficial for regression-based models, stabilizing predictions and improving model accuracy, even if slightly below the ideal ratio.
  - **Supporting Literature:** Barbet-Massin et al. (2012) recommend generating pseudo-absence datasets that are significantly larger than presence data for regression models.

#### Random Forest (RF) and Gradient Boosting Machines (e.g., XGBoost):
  - **Pseudo-Absence Strategy:** Maintain a 1:1 ratio of pseudo-absences to presences per model run by subsampling pseudo-absences.
  - **Iterative Averaging:** Perform at least 10 iterations, subsampling 1:1 ratios each time, and average the results to enhance model stability and accuracy.
  - **Rationale:** Machine learning models effectively handle imbalanced datasets but benefit from balanced pseudo-absence data to avoid overfitting (Fitzpatrick et al., 2011). Averaging multiple runs further reduces variability.
  - **Supporting Literature:** Fitzpatrick et al. (2011) and Ridgeway (2021) highlight the need for balanced datasets and iterative modeling in machine learning approaches.

#### Maxent (Maximum Entropy Model):
  - **Pseudo-Absence Strategy:** Use all available pseudo-absence points to ensure environmental diversity, even if the ratio slightly deviates from the 10:1 target.
  - **Rationale:** Maxent relies on background points rather than true absences, and a larger number of points ensures comprehensive environmental sampling for robust predictions.
  - **Supporting Literature:** Phillips et al. (2006) and Elith et al. (2011) demonstrate that Maxent performs optimally with pseudo-absence datasets that capture the study area's environmental variability.

### Compromise with Literature-Based Methodology
While the pseudo-absence subsampling strategy broadly aligns with the methodology outlined in the project's supplementary methods document (`SupplementaryMethodsMaterial.ipynb`), deviations were necessary due to the following challenges:
- **Species-Specific Disparities:** For instance, *Rana temporaria* had significantly more presence points compared to *Lissotriton helveticus*, resulting in different absolute numbers of pseudo-absences generated.
- **Buffer Constraints:** Ecological dispersal distances were incorporated to avoid clustering of pseudo-absence points near presence points, leading to slightly fewer pseudo-absences than the target ratio for some species.
- **Ecological Validity:** The adjustments ensure that pseudo-absence points remain ecologically meaningful, avoiding overrepresentation of clustered absences.

This adjusted methodology remains a robust approach for SDMs while addressing the unique constraints of this study. Future studies without such constraints can follow the original guidelines in `SupplementaryMethodsMaterial.ipynb`.

### Expected Outcome
This tailored approach ensures:
- Robust pseudo-absence representation for regression models, improving accuracy even with slight deviations from ideal ratios.
- Balanced datasets for machine learning models, preventing overfitting and enhancing generalization.
- Comprehensive environmental sampling for Maxent, enabling reliable presence-only predictions.

By balancing literature-based guidelines with practical adjustments, this methodology ensures ecological relevance and predictive robustness while addressing the challenges encountered during pseudo-absence generation.


In [4]:
import pandas as pd
import os

# File paths for combined datasets
combined_files = {
    "Bufo bufo": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Bufo bufo_combined_full.csv",
    "Rana temporaria": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Rana temporaria_combined_full.csv",
    "Lissotriton helveticus": "C:/GIS_Course/MScThesis-MaviSantarelli/data/Processed/Lissotriton helveticus_combined_full.csv"
}

# Output directory for subsampled data
output_dir = "C:/GIS_Course/MScThesis-MaviSantarelli/data/Subsampled"
os.makedirs(output_dir, exist_ok=True)

# Subsampling configurations
subsampling_config = {
    "GLM": "all",  # Use all pseudo-absences
    "GAM": "all",  # Use all pseudo-absences
    "Maxent": "all",  # Use all pseudo-absences
    "RF": "1:1",  # 1:1 ratio with presences
    "XGBoost": "1:1"  # 1:1 ratio with presences
}

# Function to subsample pseudo-absences
def subsample_pseudo_absences(data, config, iteration=None):
    occurrences = data[data["label"] == 1]
    pseudo_absences = data[data["label"] == 0]
    
    if config == "all":
        return data  # Use the entire dataset
    
    elif config == "1:1":
        # Subsample pseudo-absences to match the number of occurrences
        subsampled_pseudo_absences = pseudo_absences.sample(
            n=len(occurrences), random_state=(42 if iteration is None else iteration)
        )
        return pd.concat([occurrences, subsampled_pseudo_absences], ignore_index=True)

# Process each species
for species, filepath in combined_files.items():
    print(f"Processing {species}...")
    data = pd.read_csv(filepath)
    
    for model, config in subsampling_config.items():
        print(f"  Subsampling for {model}...")

        # For RF and XGBoost, perform multiple iterations for averaging
        if model in ["RF", "XGBoost"] and config == "1:1":
            for i in range(10):  # Perform 10 iterations
                subsampled_data = subsample_pseudo_absences(data, config, iteration=i)
                output_path = f"{output_dir}/{species}_{model}_subsampled_run{i + 1}.csv"
                subsampled_data.to_csv(output_path, index=False)
                print(f"    Iteration {i + 1}: Subsampled data saved at {output_path}")
        else:
            # For GLM, GAM, and Maxent, perform a single subsampling step
            subsampled_data = subsample_pseudo_absences(data, config)
            output_path = f"{output_dir}/{species}_{model}_subsampled.csv"
            subsampled_data.to_csv(output_path, index=False)
            print(f"    Subsampled data saved at {output_path}")


Processing Bufo bufo...
  Subsampling for GLM...
    Subsampled data saved at C:/GIS_Course/MScThesis-MaviSantarelli/data/Subsampled/Bufo bufo_GLM_subsampled.csv
  Subsampling for GAM...
    Subsampled data saved at C:/GIS_Course/MScThesis-MaviSantarelli/data/Subsampled/Bufo bufo_GAM_subsampled.csv
  Subsampling for Maxent...
    Subsampled data saved at C:/GIS_Course/MScThesis-MaviSantarelli/data/Subsampled/Bufo bufo_Maxent_subsampled.csv
  Subsampling for RF...
    Iteration 1: Subsampled data saved at C:/GIS_Course/MScThesis-MaviSantarelli/data/Subsampled/Bufo bufo_RF_subsampled_run1.csv
    Iteration 2: Subsampled data saved at C:/GIS_Course/MScThesis-MaviSantarelli/data/Subsampled/Bufo bufo_RF_subsampled_run2.csv
    Iteration 3: Subsampled data saved at C:/GIS_Course/MScThesis-MaviSantarelli/data/Subsampled/Bufo bufo_RF_subsampled_run3.csv
    Iteration 4: Subsampled data saved at C:/GIS_Course/MScThesis-MaviSantarelli/data/Subsampled/Bufo bufo_RF_subsampled_run4.csv
    Iteratio

---

### What the Code Did:
#### 1. **Input Data**:
   - Occurrence and pseudo-absence datasets were combined for each species (*Bufo bufo*, *Rana temporaria*, and *Lissotriton helveticus*).
   - Pseudo-absence points were generated previously, considering ecological dispersal buffers to avoid clustering near presence points.

#### 2. **Subsampling Configuration**:
   - **GLM, GAM, and Maxent Models**:
     - Utilised all available pseudo-absence data to ensure sufficient representation of ecological variability.
   - **Random Forest (RF) and XGBoost Models**:
     - Applied a 1:1 ratio of pseudo-absences to presence points.
     - For each species, 10 subsampled datasets were created to facilitate iterative model runs and averaging, improving model stability and accuracy.

#### 3. **Iteration for RF and XGBoost**:
   - For these models, the pseudo-absence data were subsampled 10 times per species to create multiple datasets.
   - This iterative subsampling supports model averaging during the training phase, reducing variability and preventing overfitting.

#### 4. **Output Files**:
   - Subsampled datasets were saved in a structured directory with filenames indicating the species, model type, and iteration (if applicable).
   - Example filenames:
     - `Bufo bufo_RF_subsampled_run1.csv`
     - `Rana temporaria_GLM_subsampled.csv`

### Rationale
- The pseudo-absence datasets for *GLM*, *GAM*, and *Maxent* required a higher ratio of pseudo-absences relative to occurrences to ensure model stability and accuracy. This approach aligns with best practices in regression-based and presence-only modeling.
- For *RF* and *XGBoost*, iterative subsampling was necessary to achieve balanced datasets and facilitate model averaging, addressing the models’ sensitivity to pseudo-absence distributions.

### Expected Outcomes
This step ensures:
- Robust and ecologically valid datasets for each modeling technique.
- Balanced datasets for machine learning models, improving generalization to unseen data.
- Comprehensive environmental representation for presence-only models like Maxent.

The resulting datasets are now ready for the model training phase, where the tailored subsampling strategy will enhance predictive performance and reliability.


## 4. Validate the Data

In [5]:
import pandas as pd
import os

# Path to the output directory
output_dir = "C:/GIS_Course/MScThesis-MaviSantarelli/data/Subsampled"

# Species and models
species_list = ["Bufo bufo", "Rana temporaria", "Lissotriton helveticus"]
models = ["GLM", "GAM", "Maxent", "RF", "XGBoost"]

# Validation
for species in species_list:
    print(f"Validating files for {species}...")
    for model in models:
        if model in ["RF", "XGBoost"]:
            # Check 10 iterations
            for i in range(1, 11):
                filepath = f"{output_dir}/{species}_{model}_subsampled_run{i}.csv"
                if os.path.exists(filepath):
                    data = pd.read_csv(filepath)
                    presence_count = data[data['label'] == 1].shape[0]
                    absence_count = data[data['label'] == 0].shape[0]
                    print(f"  Iteration {i} - {model}: {presence_count} presences, {absence_count} pseudo-absences")
                else:
                    print(f"  File missing: {filepath}")
        else:
            # Check single subsampling for GLM, GAM, Maxent
            filepath = f"{output_dir}/{species}_{model}_subsampled.csv"
            if os.path.exists(filepath):
                data = pd.read_csv(filepath)
                presence_count = data[data['label'] == 1].shape[0]
                absence_count = data[data['label'] == 0].shape[0]
                print(f"  {model}: {presence_count} presences, {absence_count} pseudo-absences")
            else:
                print(f"  File missing: {filepath}")

print("\nValidation complete.")


Validating files for Bufo bufo...
  GLM: 716 presences, 5600 pseudo-absences
  GAM: 716 presences, 5600 pseudo-absences
  Maxent: 716 presences, 5600 pseudo-absences
  Iteration 1 - RF: 716 presences, 716 pseudo-absences
  Iteration 2 - RF: 716 presences, 716 pseudo-absences
  Iteration 3 - RF: 716 presences, 716 pseudo-absences
  Iteration 4 - RF: 716 presences, 716 pseudo-absences
  Iteration 5 - RF: 716 presences, 716 pseudo-absences
  Iteration 6 - RF: 716 presences, 716 pseudo-absences
  Iteration 7 - RF: 716 presences, 716 pseudo-absences
  Iteration 8 - RF: 716 presences, 716 pseudo-absences
  Iteration 9 - RF: 716 presences, 716 pseudo-absences
  Iteration 10 - RF: 716 presences, 716 pseudo-absences
  Iteration 1 - XGBoost: 716 presences, 716 pseudo-absences
  Iteration 2 - XGBoost: 716 presences, 716 pseudo-absences
  Iteration 3 - XGBoost: 716 presences, 716 pseudo-absences
  Iteration 4 - XGBoost: 716 presences, 716 pseudo-absences
  Iteration 5 - XGBoost: 716 presences, 716

### **Summary of Validation Results**

### GLM, GAM, and Maxent
- The number of pseudo-absences matches the total available for each species, adhering to the `all` configuration.
- **Counts**:
  - **Bufo bufo**: 716 presences, 5600 pseudo-absences
  - **Rana temporaria**: 1110 presences, 11903 pseudo-absences
  - **Lissotriton helveticus**: 337 presences, 3158 pseudo-absences

### RF and XGBoost
- Iterative subsampling (10 runs) was successfully performed for a 1:1 ratio of presences to pseudo-absences.
- **Counts for each iteration**:
  - **Bufo bufo**: 716 presences, 716 pseudo-absences
  - **Rana temporaria**: 1110 presences, 1110 pseudo-absences
  - **Lissotriton helveticus**: 337 presences, 337 pseudo-absences

### File Generation
- All files for GLM, GAM, Maxent, and iterations of RF and XGBoost are present in the designated directory.


---

## Next Steps

### Quality Check
**Open a few CSV files and verify:**
   - Presence of all required columns (e.g., environmental predictors, `label` column).
   - Proper subsampling for RF and XGBoost (e.g., balanced counts in sampled datasets).
   - Ensure there are no missing or erroneous data (e.g., `NaN` or invalid values).


In [6]:
import pandas as pd
import os

# Define paths to the output directory and generated files
output_dir = "C:/GIS_Course/MScThesis-MaviSantarelli/data/Subsampled"
species_list = ["Bufo bufo", "Rana temporaria", "Lissotriton helveticus"]
models = ["GLM", "GAM", "Maxent", "RF", "XGBoost"]

# Function to perform quality checks on a CSV file
def quality_check(filepath, model, iteration=None):
    try:
        # Load the dataset
        df = pd.read_csv(filepath)

        # Check for required columns
        required_columns = ["label"]  # Add more columns if needed
        missing_columns = [col for col in required_columns if col not in df.columns]
        if missing_columns:
            print(f"Missing columns in {filepath}: {missing_columns}")
            return False

        # Check for proper subsampling
        if model in ["RF", "XGBoost"]:
            counts = df["label"].value_counts()
            if counts[1] != counts[0]:
                print(f"Imbalanced dataset in {filepath}: {counts.to_dict()}")
                return False

        # Check for missing or invalid values
        if df.isnull().any().any():
            print(f"Missing values found in {filepath}")
            return False

        print(f"Quality check passed for {filepath}")
        return True

    except Exception as e:
        print(f"Error checking {filepath}: {e}")
        return False

# Perform quality checks
for species in species_list:
    for model in models:
        if model in ["RF", "XGBoost"]:
            for i in range(1, 11):  # Iterate through the 10 runs
                filepath = os.path.join(output_dir, f"{species}_{model}_subsampled_run{i}.csv")
                quality_check(filepath, model, iteration=i)
        else:
            filepath = os.path.join(output_dir, f"{species}_{model}_subsampled.csv")
            quality_check(filepath, model)


Quality check passed for C:/GIS_Course/MScThesis-MaviSantarelli/data/Subsampled\Bufo bufo_GLM_subsampled.csv
Quality check passed for C:/GIS_Course/MScThesis-MaviSantarelli/data/Subsampled\Bufo bufo_GAM_subsampled.csv
Quality check passed for C:/GIS_Course/MScThesis-MaviSantarelli/data/Subsampled\Bufo bufo_Maxent_subsampled.csv
Quality check passed for C:/GIS_Course/MScThesis-MaviSantarelli/data/Subsampled\Bufo bufo_RF_subsampled_run1.csv
Quality check passed for C:/GIS_Course/MScThesis-MaviSantarelli/data/Subsampled\Bufo bufo_RF_subsampled_run2.csv
Quality check passed for C:/GIS_Course/MScThesis-MaviSantarelli/data/Subsampled\Bufo bufo_RF_subsampled_run3.csv
Quality check passed for C:/GIS_Course/MScThesis-MaviSantarelli/data/Subsampled\Bufo bufo_RF_subsampled_run4.csv
Quality check passed for C:/GIS_Course/MScThesis-MaviSantarelli/data/Subsampled\Bufo bufo_RF_subsampled_run5.csv
Quality check passed for C:/GIS_Course/MScThesis-MaviSantarelli/data/Subsampled\Bufo bufo_RF_subsampled_r

### **What the Code Does**
#### 1. Loads the Generated Files:
   - Iterates through all species and models, including multiple iterations for RF and XGBoost.

#### 2. Checks Required Columns:
   - Ensures essential columns like `label` are present in each dataset.

#### 3. Validates Subsampling:
   - Confirms a 1:1 ratio of presences to pseudo-absences for RF and XGBoost datasets.

#### 4. Identifies Missing Values:
   - Flags any missing (`NaN`) or invalid values in the datasets.

#### 5. Reports Results:
   - Outputs whether each file passes or fails the quality check, along with any specific issues detected.

## Next Steps
With all datasets validated, the next step involves proceeding to model training and evaluation using these prepared datasets.

## References

- Barbet-Massin, M., Jiguet, F., Albert, C. H., & Thuiller, W. (2012). Selecting pseudo-absences for species distribution models: How, where and how many? *Methods in Ecology and Evolution*, *3*(2), 327-338. https://doi.org/10.1111/j.2041-210X.2011.00172.x

- Phillips, S. J., Anderson, R. P., & Schapire, R. E. (2006). Maximum entropy modeling of species geographic distributions. *Ecological Modelling*, *190*(3-4), 231-259. https://doi.org/10.1016/j.ecolmodel.2005.03.026

- Elith, J., Phillips, S. J., Hastie, T., Dudík, M., Chee, Y. E., & Yates, C. J. (2011). A statistical explanation of MaxEnt for ecologists. *Diversity and Distributions*, *17*(1), 43-57. https://doi.org/10.1111/j.1472-4642.2010.00725.x

- Fitzpatrick, M. C., Gotelli, N. J., & Ellison, A. M. (2011). MaxEnt versus MaxLike: Empirical comparisons with ant species distributions. *Ecosphere*, *2*(5), 1-15. https://doi.org/10.1890/ES11-00075.1

- Ridgeway, G. (2021). Generalized Boosted Regression Models. *Comprehensive R Archive Network (CRAN)*. https://cran.r-project.org/web/packages/gbm/vignettes/gbm.pdf
