Skip to content

Commit

Permalink
Refactor Catalogs (#921)
Browse files Browse the repository at this point in the history
Co-authored-by: Tjalling-dejong <93266159+Tjalling-dejong@users.noreply.github.com>
Co-authored-by: Tjalling-dejong <tjalling.dejong@deltares.nl>
  • Loading branch information
3 people committed May 13, 2024
1 parent 4b4012d commit 39cc2c5
Show file tree
Hide file tree
Showing 47 changed files with 5,868 additions and 1,309 deletions.
2 changes: 2 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
* text eol=lf

# GitHub syntax highlighting
pixi.lock linguist-language=YAML
15 changes: 11 additions & 4 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,20 @@ Fixes #<issue number>
## Explanation
Explain how you addressed the bug/feature request, what choices you made and why.

## Checklist
## General Checklist
- [ ] Updated tests or added new tests
- [ ] Branch is up to date with `main`
- [ ] Tests & pre-commit hooks pass
- [ ] Updated documentation if needed
- [ ] Updated changelog.rst if needed
- [ ] For predefined catalogs: update the catalog version in the file itself, the references in data/predefined_catalogs.yml, and the changelog in data/changelog.rst
- [ ] Updated documentation
- [ ] Updated changelog.rst

## Data/Catalog checklist
- [ ] `data/catalogs/predefined_catalogs.yml` has not been modified.
- [ ] None of the old `data_catalog.yml` files have been chagned
- [ ] `data/chagnelog.rst` has been updated
- [ ] new file uses `LF` line endings (done automatically if you used `update_versions.py`)
- [ ] New file has been tested locally
- [ ] Tests have been added using the new file in the test suite

## Additional Notes (optional)
Add any additional notes or information that may be helpful.
2 changes: 1 addition & 1 deletion .github/workflows/docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ jobs:
test-docs:
defaults:
run:
shell: bash -l {0}
shell: bash -e -l {0}
timeout-minutes: 30
runs-on: ubuntu-latest
steps:
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ jobs:
build:
defaults:
run:
shell: bash -l {0}
shell: bash -e -l {0}
strategy:
fail-fast: false
matrix:
Expand Down
54 changes: 35 additions & 19 deletions data/catalogs/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,25 +11,24 @@ deltares_data

The HydroMT Deltares Data is managed by the HydroMT team.
For adding new data to the deltares_data.yml please follow the conventions given hereinafter.
The data is currently only stored on the deltares server: p:/wflow_global/hydromt
The data is currently only stored on the deltares network drive: p:/wflow_global/hydromt

preferred data formats to download
Preferred data formats to download
-----------------------------------
vector data: flatgeobuf (because they contain a spatial index and are therefore much faster)
raster data (2D): cloud optimized geotiff
raster data (3D): zarr

data storage (p:/wflow_global/hydromt)
--------------------------------------
Data storage in the Deltares network drive (p:/wflow_global/hydromt)
--------------------------------------------------------------------

data used by the geoserver:
DO NOT CHAGE WITHOUT CONSULTATION
DO NOT CHANGE WITHOUT CONSULTATION
- alosdem
- copdem

writing convention:
- lower case
- with underscores
- snake_case (i.e. lower case with underscores)

folder structure:
- no subcategories
Expand Down Expand Up @@ -62,24 +61,21 @@ folder structure:
deltares_data.yml
------------------
writing convention:
- lower case
- with underscore
- snake_case (i.e. lower case with underscores)
- the key "data_type" follows this convention but the data type itself is written in cam case (RasterDataset/GeoDataFrame/GeoDataset)
- two spaces for indentation

data versioning:
- data always refers to a specific version
- version is indicated within the name of the alias
- short name refers to that version
- convention: [data_name]_v[version_number]
- e.g. eobs_v22.0e
- convention: `[data_name]_v[version_number]` (e.g. `eobs_v22.0e`)

structure per data set:
- use placeholders where possible
- order the data sets alphabetically
- order the components of each data set alphabetically
- for adding meta data use the following optional keys:

```yaml
category:
notes:
paper_doi:
Expand All @@ -89,12 +85,32 @@ source_license:
source_url:
source_version:
unit:
```

updates
Updates
-------

- create new branch on github
- make changes and bump the version in the global meta section using `calendar versioning <https://calver.org/>`
- test your yml file (Can the added/changed data sources be read through HydroMT?)
- create pull request
- add new version to hydromt\data\predefined_catalogs.yml
To preserve reproducibility for older versions of the catalogs, we DO NOT modify old catalogs in any way. Even adding different whitespaces to the file is problematic because of how they are retrieved. Instead, we use a versioning system for the catalog files themselves. If you want to update one of the catalogs, follow the following steps

1. create new branch on github
2. make a new folder with the name of the version you are going to create
3. add the new version of the data catalog to this new folder, and make sure it is called `data_catalog.yml`. You can also copy the latest data catalog into the new folder and simply edit the copied version.
4. bump the version in the global meta section using semantic versioning
5. run update_versions.py, this will create a registry file with the versions and SHA256 hashes of the data catalogs. It is very important that the files have Linux style line endings (LF) as opposed to windows style line endings (CRLF) to keep hashes consistent. If this is not done, pooch will not be able to find the catalogs. This is done automatically for you (CRLF -> LF) if you are updating from windows.
6. test your yml file, (for more information on testing see section below).
7. create a pull request targeting the main branch
8. Once the pull request get's merged into the `main` branch, it should be available to all HydroMT users.

Testing
-------
If you want to make catalogs for HydroMT testing purposes, or to update a catalog with new data, we must first ensure that HydroMT can properly read and use your data. To do this, we need to test the data with HydroMT in our CI environment on github. DO NOT modify the `data/catalogs/predefined_catalogs.yml` file to do this. The use of this file is deprecated and it is maintained for backwards compatibility, but should no longer be used.

Testing a new data catalog should be fairly straight forward once it is created. The level of testing we require to add new catalogs can varry depending on the size, importance, and popularity of the set. In order of importance the tests that should be done are:

1. Instantiate a catalog, and retrieve the dataset from it using the appropriate `get_*` function (e.g. `get_rasterdataset` for raster data)
2. The dataset should slice properly in whatever ways are appropriate. (e.g. requesting a dataset with only certain variables should return only that data if the driver supports it)
3. If the dataset requires special logic to merge several parts please add tests to demostrate the correct working of this as well.
4. Units should be acounted for properly using properties such as `unit`, `unti_add`, and `unit_mult` where appropriate
5. Whatever domain specific quality assurance should be done, to avoid for example, rounding errors near the boundaries.

At least, point one needs to be verified locally by the author of the PR, and preferably a test should be made for it in our CI as well. Depending on context we may ask you to verify other points (on this list) as well.
4 changes: 4 additions & 0 deletions data/catalogs/artifact_data/registry.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
v0.0.6/data_catalog.yml 5d9e47158185f1afbf793db68c887f6e6b119d7ffd3edfbc198e5ae3a9d760f3
v0.0.7/data_catalog.yml 8c4aa8e5bc28fd9a6d25b93d73dc091dd9aa6beec3de48dc2a56b57aafe415ee
v0.0.8/data_catalog.yml c1cf2229eeb93607ea881fdbced45ebf43471271f0b7d878db8183734274cd88
v0.0.9/data_catalog.yml a074379f3ef244f3a860ec40165163538b6d690d8a3cbc8c6e883e02a8936258
Loading

0 comments on commit 39cc2c5

Please sign in to comment.