New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue with CORDEX datasets requiring different dataset tags for downloads and fixes #2066
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work @ljoakim ! I was able to load the data in DKRZ by just modifying the datasets name to just contain the so called rcm_name
.
However there is one small issue when using wildcards in the recipe. For instance, a dataset defined like this:
- {
project: CORDEX,
dataset: '*',
institute: ICTP,
exp: '*',
ensemble: '*',
mip: 'mon',
short_name: tas,
rcm_version: '*',
driver: '*',
domain: 'EUR-11',
timerange: '*/P1Y'
}
will try to find the following files, in which the institute
is duplicated in the dataset name:
2023-05-29 10:01:12,621 UTC [2545495] ERROR Looked for files matching
/pool/data/CORDEX/data/cordex/output/EUR-11/ICTP/ICTP/CNRM-CERFACS-CNRM-CM5/historical/ICTP-ICTP-RegCM4-6/v2/mon/tas/*/tas_EUR-11_ICTP_CNRM-CERFACS-CNRM-CM5_historical_ICTP-ICTP-RegCM4-6_v2_mon*.nc
/work/ik1017/C3SCORDEX/data/c3s-cordex/output/EUR-11/ICTP/ICTP/CNRM-CERFACS-CNRM-CM5/historical/ICTP-ICTP-RegCM4-6/v2/mon/tas/*/tas_EUR-11_ICTP_CNRM-CERFACS-CNRM-CM5_historical_ICTP-ICTP-RegCM4-6_v2_mon*.nc
It looks like this is because of the substitution in:
ESMValCore/esmvalcore/local.py
Line 525 in 2e7e384
def _path2facets(path: Path, drs: str) -> dict[str, str]: |
Which is substituting the institute
facet by ICTP in this case, and the dataset
facet by ICTP-RegCM4-6
. I guess this function should be modified to take that into account so that the right files are found.
Tough one. It seems The path parts cannot be split on This seems to prohibit the use of any other character than |
I think the easiest would be to treat this separately, as you say, trying to get rid of characters just because they are separated by for filename in filenames:
file = LocalFile(filename)
file.facets.update(_path2facets(file, drs))
if facets['project'] == 'CORDEX'
institute = file.facets.get('institute')
file.facets['dataset'] = file.facets['dataset'].replace(f'{institute}-', '')
if file.facets.get('version') == 'latest':
filter_latest = True
files.append(file) Also in this way, if facets['dataset'] is already correct nothing is going to be replaced. Or you could use regex to find if the |
My recommendation would be to create a new facet, e.g. called |
Thanks for the suggestions! I'm doing some experimentation, and thought I'd report my progress so far. I tried the first approach (the fix around I've added a new file I'm using the problematic dataset specification given by @sloosvel above. When setting main_log_debug_20230531_123427.txt The extra facet I can get passed this by e.g. adding main_log_debug_20230531_123459.txt which seems to occur because |
I got a bit stuck on this last week, and had some time to look into it again. I didn't find any solutions to get past the errors I got (see previous post), which may very well be due to my limited experience :). I have now instead tried an approach similar to what was first suggested by @sloosvel with some additional logic to def _path2facets(path: Path, drs: str) -> dict[str, str]:
"""Extract facets from a path using a DRS like '{facet1}/{facet2}'."""
keys = []
for key in re.findall(r"{(.*?)}[^-]", f"{drs} "):
key = key.split('.')[0] # Remove trailing .lower and .upper
keys.append(key)
start, end = -len(keys) - 1, -1
values = path.parts[start:end]
facets = {
key: values[idx] for idx, key in enumerate(keys) if "{" not in key
}
if len(facets) != len(keys):
# Extract '-'-separated facet: {facet1}-{facet2}, where
# either facet1 or facet2 is already known.
re_facets = facets.copy()
for idx, key in enumerate(keys):
for facet in re.findall(r"{(.*?)}", f"{{{key}}}"):
if facet not in re_facets:
re_facets[facet] = rf"(?P<{facet}>.*)"
facet_found = (
re.search(f"{{{key}}}".format(**re_facets), values[idx])
)
if facet_found is not None:
facets.update(facet_found.groupdict())
return facets |
Sorry @ljoakim I have not had the chance to look at this, let me see if there is any way to go through the facets issue and if I cannot see any solution either, maybe we can go for your latest approach? |
That approach looks fine to me, I really like that is not project specific. Would it be possible to combine that with the feature proposed here #1943 without too much trouble? In general, I would say that if someone chooses to organize their data in a way that makes it impossible to figure out what the facets are, it is their own problem that wildcards do not work, but in this specific case, it would be nice to add support for it because it is ESGF that is deviating from the standard set by the project. |
@bouweandela I will look into it! |
Thanks @ljoakim , with the latest changes there are no more issues when using wildcards. Maybe you could add a test in https://github.com/ESMValGroup/ESMValCore/blob/main/tests/unit/local/test_facets.py , to cover the changes in path2facets more explicitly. Unless @bouweandela has further comments, I would say this is almost ready. |
I'll look into adding a test @sloosvel! Some comments regarding #1943: I simplified the extraction here, because I don't want the fix to be more general than necessary at the moment, in order to have less to reproduce for #1943. I don't see any issues with reproducing this for a path drs, although the approach will be slightly different (given that the implementation of the facet extraction is different). However, there is an additional complication with the input filename. It will not be possible to extract the institute and dataset from the filename alone, since neither the institute nor the dataset occur by themselves in the filename. So for cordex, this will require disabling extraction from filename, or some extra logic to skip extraction of institute and dataset. |
Works fine for what I have tried, the branch needs updating but other than that I am happy to approve this. |
Good to hear! For my part I think the I'm currently a bit short on time to work on this. I'm using esmvalcore for fixing cordex files, but since I don't need the recipe/diagnostics pipeline, I only use the cordex fixes from the esmvalcore package, i.e. I'm not running or testing the full pipeline at the moment. Apart from the facet extraction code this fix is mostly renamings, so if there are any pressing cordex fixes waiting to be merged, this MR should not be in the way. |
1249815
to
cf15988
Compare
My apologies for not responding earlier, I have very limited time to work on ESMValtool at the moment. Please go ahead with this pull request and we'll see about getting it to work with #1943 later. |
cf15988
to
388c46d
Compare
388c46d
to
77aad28
Compare
Thanks @ljoakim! Before merging, could you please update the description of the pull request and add some instructions on how to upgrade recipes and custom config-developer.yml files for users who are affected by the backward-incompatible changes here? |
I've updated the description with instructions, I hope they are clear. |
Description
This PR fixes the inconsistency in expected
dataset
tag for CORDEX datasets.dataset
is expected to containrcm_name
only, instead of{institute}-{rcm_name}
as was previously required for fixes to be loaded/applied.cmor/_fixes/cordex/
has been renamed to reflect the previous point.config-developer.yml
has the following changes for CORDEX entries:spec
,BADC
,input_file
,{dataset}
has been replaced by{institute}-{dataset}
, and foroutput_file
,{dataset}
has been replaced by{institute}_{dataset}
DKRZ
andSYNDA
has been added.ESGF
still uses{dataset}
(instead of{institute}-{dataset}
) in order for automatic download and file lookup to work together.Note that this change will likely break recipes that has
{institute}-{rcm_name}
indataset
tag in CORDEX datasets, and it is incompatible with previous customconfig-developer-yml
users may have (see comment). See instructions below.Closes #2032
Instructions to get recipes working after change:
The
dataset
facet must be changed not to include institute, andinstitute
facet must be given, e.g.:must be changed to:
In a custom
config-developer.yml
, the drs templates in theCORDEX
section need to be updated to reflect the change, e.g.:should be updated to
The same goes for the
input_file
entry:should be changed to
Note that this does not apply to the
ESGF
entry, which should be kept as-is.Before you get started
Checklist
It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the π Technical or π§ͺ Scientific review.
π Changes are backward compatibleNoTo help with the number pull requests: