This repository hosts datasets and code for the paper "Cross-validation for Geospatial Data: A Framework for Estimating Generalization Performance in Geostatistical Problems". We compared the performance of five cross-validation (CV) methods - standard K-Fold CV (KFCV), BLocking CV (BLCV), BuFfered CV (BFCV), Importance-Weighted CV (IWCV) and our proposed Importance-weighted Buffered CV (IBCV) - in various geospatial scenarios.
We provided six simulation datasets and 15 real datasets.
The following abbreviations serve as [dataset name]
in a command line.
- Simulation: sim_sd, sim_si, sim_sdcs, sim_sics, sim_sirs, sim_sipcs
- HEWA1800: hewa1800_sd, hewa1800_si, hewa1800_sdcs, hewa1800_sics
- HEWA1000: hewa1000_sd, hewa1000_si, hewa1000_sdcs, hewa1000_sics
- WETA1800: weta1800_sd, weta1800_si, weta1800_sdcs, weta1800_sics
- Alaska: alaska
- Housing: house_bay, house_latitude
To run the code, install the dependencies in requirements
.
pip install -r requirements.txt
To compute model errors and their estimates of five CV methods on a specific dataset:
python run.py --dataset [dataset name]
Take the Simulation Scenario SD (sim_sd) dataset for example:
python run.py --dataset sim_sd
The results will be saved in a csv file automatically.
To run any of the following scripts, please install the dependencies in requirements_extra
first.
gen_sim
: It produces the simulation datasets. Users can generate simulations as many as they want bysim
, and change the number of sampling points and sampling strategy as well.bcv
: It splits the training set into blocks based on their geocoordinates, and then assign blocks into folds for cross-validation. Users can fine-tune the hyperparameters the number of folds byk
and the block size bybs
.cramer
: It performs the statistical test on training and test features and reports the statistics and p value. Users can set the significance level byalpha
.