Fix issue with CORDEX datasets requiring different dataset tags for downloads and fixes #2066

ljoakim · 2023-05-25T13:42:54Z

Description

This PR fixes the inconsistency in expected dataset tag for CORDEX datasets.

dataset is expected to contain rcm_name only, instead of {institute}-{rcm_name} as was previously required for fixes to be loaded/applied.
files in cmor/_fixes/cordex/ has been renamed to reflect the previous point.
tests have been updated.
the config-developer.yml has the following changes for CORDEX entries:
- for spec, BADC, input_file, {dataset} has been replaced by {institute}-{dataset}, and for output_file, {dataset} has been replaced by {institute}_{dataset}
- entries for DKRZ and SYNDA has been added.
- ESGF still uses {dataset} (instead of {institute}-{dataset}) in order for automatic download and file lookup to work together.

Note that this change will likely break recipes that has {institute}-{rcm_name} in dataset tag in CORDEX datasets, and it is incompatible with previous custom config-developer-yml users may have (see comment). See instructions below.

Closes #2032

Instructions to get recipes working after change:

The dataset facet must be changed not to include institute, and institute facet must be given, e.g.:

- {project: CORDEX, driver: ICHEC-EC-EARTH, dataset: SMHI-RCA4, ensemble: r1, mip: mon}

must be changed to:

- {project: CORDEX, driver: ICHEC-EC-EARTH, dataset: RCA4, ensemble: r1, mip: mon, institute: SMHI}

In a custom config-developer.yml, the drs templates in the CORDEX section need to be updated to reflect the change, e.g.:

spec: '{domain}/{institute}/{driver}/{exp}/{ensemble}/{dataset}/{rcm_version}/{mip}/{short_name}'

should be updated to

spec: '{domain}/{institute}/{driver}/{exp}/{ensemble}/{institute}-{dataset}/{rcm_version}/{mip}/{short_name}'

The same goes for the input_file entry:

input_file: '{short_name}_{domain}_{driver}_{exp}_{ensemble}_{dataset}_{rcm_version}_{mip}*.nc'

should be changed to

input_file: '{short_name}_{domain}_{driver}_{exp}_{ensemble}_{institute}-{dataset}_{rcm_version}_{mip}*.nc'

Note that this does not apply to the ESGF entry, which should be kept as-is.

Before you get started

☝ Create an issue to discuss what you are going to do

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.

🧪 The new functionality is relevant and scientifically sound
🛠 This pull request has a descriptive title and labels
🛠 Code is written according to the code quality guidelines
🧪 and 🛠 Documentation is available
🛠 Unit tests have been added
~~🛠 Changes are backward compatible~~ No
🛠 The list of authors is up to date
🛠 All checks below this pull request were successful

To help with the number pull requests:

🙏 We kindly ask you to review two other open pull requests in this repository

codecov · 2023-05-25T13:49:56Z

Codecov Report

Merging #2066 (77aad28) into main (be1b790) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #2066   +/-   ##
=======================================
  Coverage   93.28%   93.29%           
=======================================
  Files         238      238           
  Lines       12843    12848    +5     
=======================================
+ Hits        11981    11986    +5     
  Misses        862      862

Files	Coverage Δ
...or/_fixes/cordex/cnrm_cerfacs_cnrm_cm5/aladin63.py	`100.00% <ø> (ø)`
...xes/cordex/cnrm_cerfacs_cnrm_cm5/hadrem3_ga7_05.py	`100.00% <ø> (ø)`
...re/cmor/_fixes/cordex/ichec_ec_earth/cclm4_8_17.py	`100.00% <ø> (ø)`
...mor/_fixes/cordex/ichec_ec_earth/hadrem3_ga7_05.py	`100.00% <ø> (ø)`
...core/cmor/_fixes/cordex/ichec_ec_earth/racmo22e.py	`100.00% <ø> (ø)`
...mvalcore/cmor/_fixes/cordex/ichec_ec_earth/rca4.py	`100.00% <ø> (ø)`
...core/cmor/_fixes/cordex/ichec_ec_earth/remo2015.py	`100.00% <ø> (ø)`
...core/cmor/_fixes/cordex/miroc_miroc5/cclm4_8_17.py	`100.00% <ø> (ø)`
...alcore/cmor/_fixes/cordex/miroc_miroc5/remo2015.py	`100.00% <ø> (ø)`
...valcore/cmor/_fixes/cordex/miroc_miroc5/wrf361h.py	`100.00% <ø> (ø)`
... and 12 more

sloosvel

Thanks for the work @ljoakim ! I was able to load the data in DKRZ by just modifying the datasets name to just contain the so called rcm_name.

However there is one small issue when using wildcards in the recipe. For instance, a dataset defined like this:

  - {
     project: CORDEX,
     dataset: '*',
     institute: ICTP,
     exp: '*',
     ensemble: '*',
     mip: 'mon',
     short_name: tas,
     rcm_version: '*',
     driver: '*',
     domain: 'EUR-11',
     timerange: '*/P1Y'
   }

will try to find the following files, in which the institute is duplicated in the dataset name:

2023-05-29 10:01:12,621 UTC [2545495] ERROR   Looked for files matching
/pool/data/CORDEX/data/cordex/output/EUR-11/ICTP/ICTP/CNRM-CERFACS-CNRM-CM5/historical/ICTP-ICTP-RegCM4-6/v2/mon/tas/*/tas_EUR-11_ICTP_CNRM-CERFACS-CNRM-CM5_historical_ICTP-ICTP-RegCM4-6_v2_mon*.nc
/work/ik1017/C3SCORDEX/data/c3s-cordex/output/EUR-11/ICTP/ICTP/CNRM-CERFACS-CNRM-CM5/historical/ICTP-ICTP-RegCM4-6/v2/mon/tas/*/tas_EUR-11_ICTP_CNRM-CERFACS-CNRM-CM5_historical_ICTP-ICTP-RegCM4-6_v2_mon*.nc

It looks like this is because of the substitution in:

ESMValCore/esmvalcore/local.py

Line 525 in 2e7e384

def _path2facets(path: Path, drs: str) -> dict[str, str]:

Which is substituting the institute facet by ICTP in this case, and the dataset facet by ICTP-RegCM4-6. I guess this function should be modified to take that into account so that the right files are found.

ljoakim · 2023-05-29T13:04:31Z

Tough one. It seems _path2facets(...) expects all facets to be separated by a path separator / when building the facet map from the path.

The path parts cannot be split on -, since it will occur multiple times in the path. I was thinking of using the drs string to figure out how to split {institute}-{dataset}, but - may occur in both {institute} and {dataset}, so there is generally no unique way to make that split.

This seems to prohibit the use of any other character than / between facets in the config-developer.yml entries. I may be missing something obvious, but I think I need some help to go forward on this, do you have any suggestions? Including @bouweandela and @zklaus.

sloosvel · 2023-05-29T15:21:50Z

I think the easiest would be to treat this separately, as you say, trying to get rid of characters just because they are separated by - is not acurate because many datasets have dashes in their name. Maybe doing something like this in find_files, after updating the facets

    for filename in filenames:
        file = LocalFile(filename)
        file.facets.update(_path2facets(file, drs))
        if facets['project'] == 'CORDEX'
            institute = file.facets.get('institute')
            file.facets['dataset'] = file.facets['dataset'].replace(f'{institute}-', '')
        if file.facets.get('version') == 'latest':
            filter_latest = True
        files.append(file)

Also in this way, if facets['dataset'] is already correct nothing is going to be replaced. Or you could use regex to find if the institute is in the pattern of dataset.

bouweandela · 2023-05-30T08:18:24Z

My recommendation would be to create a new facet, e.g. called institute_and_dataset, that contains the institute-dataset value and use that in the BADC path template in config-developer.yml. The values can then be added in esmvalcore/config/extra_facets/cmip3-institutes.yml, similar to how the institutes are defined there, so they will be automatically added when parsing the recipe. That way there is no need to add special code to handle this particular project, which has the advantage that we keep the code as generic as possible.

ljoakim · 2023-05-31T13:07:08Z

Thanks for the suggestions!

I'm doing some experimentation, and thought I'd report my progress so far. I tried the first approach (the fix around _path2facets(...)), which works with a minor addition. I'm now implementing the other approach, and I've pushed a commit with changes related to that.

I've added a new file extra_facets/cordex-institute_and_datasets.yml (not yet complete) and switched to using a institute_and_dataset facet in config-developer.yml.

I'm using the problematic dataset specification given by @sloosvel above. When setting dataset: 'RegCM4-6', it works, but when keeping the wildcard, dataset: '*', I get the following error:

main_log_debug_20230531_123427.txt

The extra facet institute_and_dataset fails to be added, because '*' does not match any of the dataset facets in the new extra facets file. I assume a wildcard should be supported?

I can get passed this by e.g. adding institute_and_dataset: '*' or institute_and_dataset: 'ICTP-RegCM4-6' to the recipe dataset specification, but then I get another error:

main_log_debug_20230531_123459.txt

which seems to occur because dataset facet is not extracted from the path template specified in config-developer.yml. I get the sense I would need a mapping dataset_and_institute to dataset here, i.e. the other way around from the extra facets file. This is a bit unknown territory for me, so any suggestions would be appreciated. I'll continue my exploration meanwhile. Thanks.

ljoakim · 2023-06-08T07:30:26Z

I got a bit stuck on this last week, and had some time to look into it again. I didn't find any solutions to get past the errors I got (see previous post), which may very well be due to my limited experience :). I have now instead tried an approach similar to what was first suggested by @sloosvel with some additional logic to _path2facets(...). I know this is still only required by the CORDEX project, but at least the code is a bit more generic. The implementation uses regular expressions to extract the additional facet from drs parts which are joined with a '-' and where one facet is known. I haven't pushed any commits, I wanted your input first:

def _path2facets(path: Path, drs: str) -> dict[str, str]:
    """Extract facets from a path using a DRS like '{facet1}/{facet2}'."""
    keys = []
    for key in re.findall(r"{(.*?)}[^-]", f"{drs} "):
        key = key.split('.')[0]  # Remove trailing .lower and .upper
        keys.append(key)
    start, end = -len(keys) - 1, -1
    values = path.parts[start:end]
    facets = {
        key: values[idx] for idx, key in enumerate(keys) if "{" not in key
    }

    if len(facets) != len(keys):
        # Extract '-'-separated facet: {facet1}-{facet2}, where
        # either facet1 or facet2 is already known.
        re_facets = facets.copy()
        for idx, key in enumerate(keys):
            for facet in re.findall(r"{(.*?)}", f"{{{key}}}"):
                if facet not in re_facets:
                    re_facets[facet] = rf"(?P<{facet}>.*)"
            facet_found = (
                re.search(f"{{{key}}}".format(**re_facets), values[idx])
            )
            if facet_found is not None:
                facets.update(facet_found.groupdict())

    return facets

sloosvel · 2023-06-08T07:52:20Z

Sorry @ljoakim I have not had the chance to look at this, let me see if there is any way to go through the facets issue and if I cannot see any solution either, maybe we can go for your latest approach?

bouweandela · 2023-06-08T11:40:19Z

That approach looks fine to me, I really like that is not project specific. Would it be possible to combine that with the feature proposed here #1943 without too much trouble?

In general, I would say that if someone chooses to organize their data in a way that makes it impossible to figure out what the facets are, it is their own problem that wildcards do not work, but in this specific case, it would be nice to add support for it because it is ESGF that is deviating from the standard set by the project.

ljoakim · 2023-06-08T12:57:29Z

@bouweandela I will look into it!

sloosvel · 2023-06-15T11:03:01Z

Thanks @ljoakim , with the latest changes there are no more issues when using wildcards. Maybe you could add a test in https://github.com/ESMValGroup/ESMValCore/blob/main/tests/unit/local/test_facets.py , to cover the changes in path2facets more explicitly. Unless @bouweandela has further comments, I would say this is almost ready.

ljoakim · 2023-06-15T11:18:19Z

I'll look into adding a test @sloosvel!

Some comments regarding #1943:

I simplified the extraction here, because I don't want the fix to be more general than necessary at the moment, in order to have less to reproduce for #1943. I don't see any issues with reproducing this for a path drs, although the approach will be slightly different (given that the implementation of the facet extraction is different). However, there is an additional complication with the input filename. It will not be possible to extract the institute and dataset from the filename alone, since neither the institute nor the dataset occur by themselves in the filename. So for cordex, this will require disabling extraction from filename, or some extra logic to skip extraction of institute and dataset.

sloosvel · 2023-09-19T14:18:28Z

Works fine for what I have tried, the branch needs updating but other than that I am happy to approve this.

ljoakim · 2023-09-21T06:38:21Z

Good to hear! For my part I think the config-developer.yml is the file that may require an extra pair of eyes, since I've really only run this using the new SYNDA drs. I was also waiting for a comment on #1943 from @bouweandela, since that issue is overlapping with this one regarding the facet-from-path extraction, and we may need to synchronize this issue with that one (?).

I'm currently a bit short on time to work on this. I'm using esmvalcore for fixing cordex files, but since I don't need the recipe/diagnostics pipeline, I only use the cordex fixes from the esmvalcore package, i.e. I'm not running or testing the full pipeline at the moment.

Apart from the facet extraction code this fix is mostly renamings, so if there are any pressing cordex fixes waiting to be merged, this MR should not be in the way.

bouweandela · 2023-10-09T10:48:17Z

I was also waiting for a comment on #1943 from @bouweandela, since that issue is overlapping with this one regarding the facet-from-path extraction, and we may need to synchronize this issue with that one (?).

My apologies for not responding earlier, I have very limited time to work on ESMValtool at the moment. Please go ahead with this pull request and we'll see about getting it to work with #1943 later.

…rcm name.

…heet.

bouweandela · 2023-10-09T13:46:26Z

Thanks @ljoakim! Before merging, could you please update the description of the pull request and add some instructions on how to upgrade recipes and custom config-developer.yml files for users who are affected by the backward-incompatible changes here?

ljoakim · 2023-10-09T14:29:29Z

I've updated the description with instructions, I hope they are clear.

ljoakim self-assigned this May 29, 2023

ljoakim added the backwards incompatible change label May 29, 2023

ljoakim requested a review from sloosvel May 29, 2023 07:20

ljoakim marked this pull request as ready for review May 29, 2023 07:22

sloosvel reviewed May 29, 2023

View reviewed changes

ljoakim mentioned this pull request Jun 22, 2023

Add a feature to read facets from filenames #1943

Draft

9 tasks

ljoakim force-pushed the CORDEX_dataset_tag_is_rcm_name branch from 1249815 to cf15988 Compare October 9, 2023 10:46

ljoakim force-pushed the CORDEX_dataset_tag_is_rcm_name branch from cf15988 to 388c46d Compare October 9, 2023 10:56

zklaus added this to the v2.10.0 milestone Oct 9, 2023

zklaus approved these changes Oct 9, 2023

View reviewed changes

ljoakim added 6 commits October 9, 2023 15:27

Renaming CORDEX fixes files to rcm name only (exclude institute).

74c6dc6

Updating comments in CORDEX fixes files.

2f4355f

Updating config-developer.yml for CORDEX dataset tag to only contain …

d06825b

…rcm name.

Updating CORDEX fixes tests.

df000ff

Adding extra facets institute_and_dataset to CORDEX

edcf669

Updating CORDEX extra facets file based on ESGF CORDEX registration s…

0d21075

…heet.

ljoakim added 3 commits October 9, 2023 15:27

Handle CORDEX drs with hyphen between facets

6697834

Added test for facet extraction of cordex-like drs

b072eb1

Added test for ambiguous case with multiple hyphens

77aad28

ljoakim force-pushed the CORDEX_dataset_tag_is_rcm_name branch from 388c46d to 77aad28 Compare October 9, 2023 13:27

bouweandela merged commit 0cea725 into main Oct 9, 2023
4 checks passed

bouweandela deleted the CORDEX_dataset_tag_is_rcm_name branch October 9, 2023 14:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue with CORDEX datasets requiring different dataset tags for downloads and fixes #2066

Fix issue with CORDEX datasets requiring different dataset tags for downloads and fixes #2066

ljoakim commented May 25, 2023 •

edited

codecov bot commented May 25, 2023 •

edited

sloosvel left a comment

ljoakim commented May 29, 2023

sloosvel commented May 29, 2023

bouweandela commented May 30, 2023

ljoakim commented May 31, 2023

ljoakim commented Jun 8, 2023

sloosvel commented Jun 8, 2023 •

edited

bouweandela commented Jun 8, 2023

ljoakim commented Jun 8, 2023

sloosvel commented Jun 15, 2023

ljoakim commented Jun 15, 2023 •

edited

sloosvel commented Sep 19, 2023

ljoakim commented Sep 21, 2023

bouweandela commented Oct 9, 2023

bouweandela commented Oct 9, 2023

ljoakim commented Oct 9, 2023

Fix issue with CORDEX datasets requiring different dataset tags for downloads and fixes #2066

Fix issue with CORDEX datasets requiring different dataset tags for downloads and fixes #2066

Conversation

ljoakim commented May 25, 2023 • edited

Description

Instructions to get recipes working after change:

Before you get started

Checklist

codecov bot commented May 25, 2023 • edited

Codecov Report

sloosvel left a comment

Choose a reason for hiding this comment

ljoakim commented May 29, 2023

sloosvel commented May 29, 2023

bouweandela commented May 30, 2023

ljoakim commented May 31, 2023

ljoakim commented Jun 8, 2023

sloosvel commented Jun 8, 2023 • edited

bouweandela commented Jun 8, 2023

ljoakim commented Jun 8, 2023

sloosvel commented Jun 15, 2023

ljoakim commented Jun 15, 2023 • edited

sloosvel commented Sep 19, 2023

ljoakim commented Sep 21, 2023

bouweandela commented Oct 9, 2023

bouweandela commented Oct 9, 2023

ljoakim commented Oct 9, 2023

ljoakim commented May 25, 2023 •

edited

codecov bot commented May 25, 2023 •

edited

sloosvel commented Jun 8, 2023 •

edited

ljoakim commented Jun 15, 2023 •

edited