Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue with CORDEX datasets requiring different dataset tags for downloads and fixes #2066

Merged
merged 9 commits into from Oct 9, 2023

Conversation

ljoakim
Copy link
Contributor

@ljoakim ljoakim commented May 25, 2023

Description

This PR fixes the inconsistency in expected dataset tag for CORDEX datasets.

  • dataset is expected to contain rcm_name only, instead of {institute}-{rcm_name} as was previously required for fixes to be loaded/applied.
  • files in cmor/_fixes/cordex/ has been renamed to reflect the previous point.
  • tests have been updated.
  • the config-developer.yml has the following changes for CORDEX entries:
    • for spec, BADC, input_file, {dataset} has been replaced by {institute}-{dataset}, and for output_file, {dataset} has been replaced by {institute}_{dataset}
    • entries for DKRZ and SYNDA has been added.
    • ESGF still uses {dataset} (instead of {institute}-{dataset}) in order for automatic download and file lookup to work together.

Note that this change will likely break recipes that has {institute}-{rcm_name} in dataset tag in CORDEX datasets, and it is incompatible with previous custom config-developer-yml users may have (see comment). See instructions below.

Closes #2032


Instructions to get recipes working after change:

The dataset facet must be changed not to include institute, and institute facet must be given, e.g.:

- {project: CORDEX, driver: ICHEC-EC-EARTH, dataset: SMHI-RCA4, ensemble: r1, mip: mon}

must be changed to:

- {project: CORDEX, driver: ICHEC-EC-EARTH, dataset: RCA4, ensemble: r1, mip: mon, institute: SMHI}

In a custom config-developer.yml, the drs templates in the CORDEX section need to be updated to reflect the change, e.g.:

spec: '{domain}/{institute}/{driver}/{exp}/{ensemble}/{dataset}/{rcm_version}/{mip}/{short_name}'

should be updated to

spec: '{domain}/{institute}/{driver}/{exp}/{ensemble}/{institute}-{dataset}/{rcm_version}/{mip}/{short_name}'

The same goes for the input_file entry:

input_file: '{short_name}_{domain}_{driver}_{exp}_{ensemble}_{dataset}_{rcm_version}_{mip}*.nc'

should be changed to

input_file: '{short_name}_{domain}_{driver}_{exp}_{ensemble}_{institute}-{dataset}_{rcm_version}_{mip}*.nc'

Note that this does not apply to the ESGF entry, which should be kept as-is.


Before you get started

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the πŸ›  Technical or πŸ§ͺ Scientific review.


To help with the number pull requests:

@codecov
Copy link

codecov bot commented May 25, 2023

Codecov Report

Merging #2066 (77aad28) into main (be1b790) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #2066   +/-   ##
=======================================
  Coverage   93.28%   93.29%           
=======================================
  Files         238      238           
  Lines       12843    12848    +5     
=======================================
+ Hits        11981    11986    +5     
  Misses        862      862           
Files Coverage Ξ”
...or/_fixes/cordex/cnrm_cerfacs_cnrm_cm5/aladin63.py 100.00% <ΓΈ> (ΓΈ)
...xes/cordex/cnrm_cerfacs_cnrm_cm5/hadrem3_ga7_05.py 100.00% <ΓΈ> (ΓΈ)
...re/cmor/_fixes/cordex/ichec_ec_earth/cclm4_8_17.py 100.00% <ΓΈ> (ΓΈ)
...mor/_fixes/cordex/ichec_ec_earth/hadrem3_ga7_05.py 100.00% <ΓΈ> (ΓΈ)
...core/cmor/_fixes/cordex/ichec_ec_earth/racmo22e.py 100.00% <ΓΈ> (ΓΈ)
...mvalcore/cmor/_fixes/cordex/ichec_ec_earth/rca4.py 100.00% <ΓΈ> (ΓΈ)
...core/cmor/_fixes/cordex/ichec_ec_earth/remo2015.py 100.00% <ΓΈ> (ΓΈ)
...core/cmor/_fixes/cordex/miroc_miroc5/cclm4_8_17.py 100.00% <ΓΈ> (ΓΈ)
...alcore/cmor/_fixes/cordex/miroc_miroc5/remo2015.py 100.00% <ΓΈ> (ΓΈ)
...valcore/cmor/_fixes/cordex/miroc_miroc5/wrf361h.py 100.00% <ΓΈ> (ΓΈ)
... and 12 more

@ljoakim ljoakim self-assigned this May 29, 2023
@ljoakim ljoakim requested a review from sloosvel May 29, 2023 07:20
@ljoakim ljoakim marked this pull request as ready for review May 29, 2023 07:22
Copy link
Contributor

@sloosvel sloosvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work @ljoakim ! I was able to load the data in DKRZ by just modifying the datasets name to just contain the so called rcm_name.

However there is one small issue when using wildcards in the recipe. For instance, a dataset defined like this:

  - {
     project: CORDEX,
     dataset: '*',
     institute: ICTP,
     exp: '*',
     ensemble: '*',
     mip: 'mon',
     short_name: tas,
     rcm_version: '*',
     driver: '*',
     domain: 'EUR-11',
     timerange: '*/P1Y'
   }

will try to find the following files, in which the institute is duplicated in the dataset name:

2023-05-29 10:01:12,621 UTC [2545495] ERROR   Looked for files matching
/pool/data/CORDEX/data/cordex/output/EUR-11/ICTP/ICTP/CNRM-CERFACS-CNRM-CM5/historical/ICTP-ICTP-RegCM4-6/v2/mon/tas/*/tas_EUR-11_ICTP_CNRM-CERFACS-CNRM-CM5_historical_ICTP-ICTP-RegCM4-6_v2_mon*.nc
/work/ik1017/C3SCORDEX/data/c3s-cordex/output/EUR-11/ICTP/ICTP/CNRM-CERFACS-CNRM-CM5/historical/ICTP-ICTP-RegCM4-6/v2/mon/tas/*/tas_EUR-11_ICTP_CNRM-CERFACS-CNRM-CM5_historical_ICTP-ICTP-RegCM4-6_v2_mon*.nc

It looks like this is because of the substitution in:

def _path2facets(path: Path, drs: str) -> dict[str, str]:

Which is substituting the institute facet by ICTP in this case, and the dataset facet by ICTP-RegCM4-6. I guess this function should be modified to take that into account so that the right files are found.

@ljoakim
Copy link
Contributor Author

ljoakim commented May 29, 2023

Tough one. It seems _path2facets(...) expects all facets to be separated by a path separator / when building the facet map from the path.

The path parts cannot be split on -, since it will occur multiple times in the path. I was thinking of using the drs string to figure out how to split {institute}-{dataset}, but - may occur in both {institute} and {dataset}, so there is generally no unique way to make that split.

This seems to prohibit the use of any other character than / between facets in the config-developer.yml entries. I may be missing something obvious, but I think I need some help to go forward on this, do you have any suggestions? Including @bouweandela and @zklaus.

@sloosvel
Copy link
Contributor

I think the easiest would be to treat this separately, as you say, trying to get rid of characters just because they are separated by - is not acurate because many datasets have dashes in their name. Maybe doing something like this in find_files, after updating the facets

    for filename in filenames:
        file = LocalFile(filename)
        file.facets.update(_path2facets(file, drs))
        if facets['project'] == 'CORDEX'
            institute = file.facets.get('institute')
            file.facets['dataset'] = file.facets['dataset'].replace(f'{institute}-', '')
        if file.facets.get('version') == 'latest':
            filter_latest = True
        files.append(file)

Also in this way, if facets['dataset'] is already correct nothing is going to be replaced. Or you could use regex to find if the institute is in the pattern of dataset.

@bouweandela
Copy link
Member

My recommendation would be to create a new facet, e.g. called institute_and_dataset, that contains the institute-dataset value and use that in the BADC path template in config-developer.yml. The values can then be added in esmvalcore/config/extra_facets/cmip3-institutes.yml, similar to how the institutes are defined there, so they will be automatically added when parsing the recipe. That way there is no need to add special code to handle this particular project, which has the advantage that we keep the code as generic as possible.

@ljoakim
Copy link
Contributor Author

ljoakim commented May 31, 2023

Thanks for the suggestions!

I'm doing some experimentation, and thought I'd report my progress so far. I tried the first approach (the fix around _path2facets(...)), which works with a minor addition. I'm now implementing the other approach, and I've pushed a commit with changes related to that.

I've added a new file extra_facets/cordex-institute_and_datasets.yml (not yet complete) and switched to using a institute_and_dataset facet in config-developer.yml.

I'm using the problematic dataset specification given by @sloosvel above. When setting dataset: 'RegCM4-6', it works, but when keeping the wildcard, dataset: '*', I get the following error:

main_log_debug_20230531_123427.txt

The extra facet institute_and_dataset fails to be added, because '*' does not match any of the dataset facets in the new extra facets file. I assume a wildcard should be supported?

I can get passed this by e.g. adding institute_and_dataset: '*' or institute_and_dataset: 'ICTP-RegCM4-6' to the recipe dataset specification, but then I get another error:

main_log_debug_20230531_123459.txt

which seems to occur because dataset facet is not extracted from the path template specified in config-developer.yml. I get the sense I would need a mapping dataset_and_institute to dataset here, i.e. the other way around from the extra facets file. This is a bit unknown territory for me, so any suggestions would be appreciated. I'll continue my exploration meanwhile. Thanks.

@ljoakim
Copy link
Contributor Author

ljoakim commented Jun 8, 2023

I got a bit stuck on this last week, and had some time to look into it again. I didn't find any solutions to get past the errors I got (see previous post), which may very well be due to my limited experience :). I have now instead tried an approach similar to what was first suggested by @sloosvel with some additional logic to _path2facets(...). I know this is still only required by the CORDEX project, but at least the code is a bit more generic. The implementation uses regular expressions to extract the additional facet from drs parts which are joined with a '-' and where one facet is known. I haven't pushed any commits, I wanted your input first:

def _path2facets(path: Path, drs: str) -> dict[str, str]:
    """Extract facets from a path using a DRS like '{facet1}/{facet2}'."""
    keys = []
    for key in re.findall(r"{(.*?)}[^-]", f"{drs} "):
        key = key.split('.')[0]  # Remove trailing .lower and .upper
        keys.append(key)
    start, end = -len(keys) - 1, -1
    values = path.parts[start:end]
    facets = {
        key: values[idx] for idx, key in enumerate(keys) if "{" not in key
    }

    if len(facets) != len(keys):
        # Extract '-'-separated facet: {facet1}-{facet2}, where
        # either facet1 or facet2 is already known.
        re_facets = facets.copy()
        for idx, key in enumerate(keys):
            for facet in re.findall(r"{(.*?)}", f"{{{key}}}"):
                if facet not in re_facets:
                    re_facets[facet] = rf"(?P<{facet}>.*)"
            facet_found = (
                re.search(f"{{{key}}}".format(**re_facets), values[idx])
            )
            if facet_found is not None:
                facets.update(facet_found.groupdict())

    return facets

@sloosvel
Copy link
Contributor

sloosvel commented Jun 8, 2023

Sorry @ljoakim I have not had the chance to look at this, let me see if there is any way to go through the facets issue and if I cannot see any solution either, maybe we can go for your latest approach?

@bouweandela
Copy link
Member

That approach looks fine to me, I really like that is not project specific. Would it be possible to combine that with the feature proposed here #1943 without too much trouble?

In general, I would say that if someone chooses to organize their data in a way that makes it impossible to figure out what the facets are, it is their own problem that wildcards do not work, but in this specific case, it would be nice to add support for it because it is ESGF that is deviating from the standard set by the project.

@ljoakim
Copy link
Contributor Author

ljoakim commented Jun 8, 2023

@bouweandela I will look into it!

@sloosvel
Copy link
Contributor

Thanks @ljoakim , with the latest changes there are no more issues when using wildcards. Maybe you could add a test in https://github.com/ESMValGroup/ESMValCore/blob/main/tests/unit/local/test_facets.py , to cover the changes in path2facets more explicitly. Unless @bouweandela has further comments, I would say this is almost ready.

@ljoakim
Copy link
Contributor Author

ljoakim commented Jun 15, 2023

I'll look into adding a test @sloosvel!

Some comments regarding #1943:

I simplified the extraction here, because I don't want the fix to be more general than necessary at the moment, in order to have less to reproduce for #1943. I don't see any issues with reproducing this for a path drs, although the approach will be slightly different (given that the implementation of the facet extraction is different). However, there is an additional complication with the input filename. It will not be possible to extract the institute and dataset from the filename alone, since neither the institute nor the dataset occur by themselves in the filename. So for cordex, this will require disabling extraction from filename, or some extra logic to skip extraction of institute and dataset.

@sloosvel
Copy link
Contributor

Works fine for what I have tried, the branch needs updating but other than that I am happy to approve this.

@ljoakim
Copy link
Contributor Author

ljoakim commented Sep 21, 2023

Good to hear! For my part I think the config-developer.yml is the file that may require an extra pair of eyes, since I've really only run this using the new SYNDA drs. I was also waiting for a comment on #1943 from @bouweandela, since that issue is overlapping with this one regarding the facet-from-path extraction, and we may need to synchronize this issue with that one (?).

I'm currently a bit short on time to work on this. I'm using esmvalcore for fixing cordex files, but since I don't need the recipe/diagnostics pipeline, I only use the cordex fixes from the esmvalcore package, i.e. I'm not running or testing the full pipeline at the moment.

Apart from the facet extraction code this fix is mostly renamings, so if there are any pressing cordex fixes waiting to be merged, this MR should not be in the way.

@ljoakim ljoakim force-pushed the CORDEX_dataset_tag_is_rcm_name branch from 1249815 to cf15988 Compare October 9, 2023 10:46
@bouweandela
Copy link
Member

I was also waiting for a comment on #1943 from @bouweandela, since that issue is overlapping with this one regarding the facet-from-path extraction, and we may need to synchronize this issue with that one (?).

My apologies for not responding earlier, I have very limited time to work on ESMValtool at the moment. Please go ahead with this pull request and we'll see about getting it to work with #1943 later.

@ljoakim ljoakim force-pushed the CORDEX_dataset_tag_is_rcm_name branch from cf15988 to 388c46d Compare October 9, 2023 10:56
@zklaus zklaus added this to the v2.10.0 milestone Oct 9, 2023
@ljoakim ljoakim force-pushed the CORDEX_dataset_tag_is_rcm_name branch from 388c46d to 77aad28 Compare October 9, 2023 13:27
@bouweandela
Copy link
Member

Thanks @ljoakim! Before merging, could you please update the description of the pull request and add some instructions on how to upgrade recipes and custom config-developer.yml files for users who are affected by the backward-incompatible changes here?

@ljoakim
Copy link
Contributor Author

ljoakim commented Oct 9, 2023

I've updated the description with instructions, I hope they are clear.

@bouweandela bouweandela merged commit 0cea725 into main Oct 9, 2023
4 checks passed
@bouweandela bouweandela deleted the CORDEX_dataset_tag_is_rcm_name branch October 9, 2023 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistency in dataset naming for CORDEX: download vs fixes
4 participants