Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

making PrePARE work with arbitrary CV #685

Open
larsbuntemeyer opened this issue Dec 8, 2022 · 22 comments
Open

making PrePARE work with arbitrary CV #685

larsbuntemeyer opened this issue Dec 8, 2022 · 22 comments
Milestone

Comments

@larsbuntemeyer
Copy link

larsbuntemeyer commented Dec 8, 2022

I would like to open this issue to collect some insight i drew from playing around with PrePARE using our preliminary CORDEX cmor tables. Although i can use cmor to rewrite data for projects without _cmip6_option="CMIP6", i can not use PrePARE to evaluate files due to the following:

  • checking filenames: PrePARE always uses CMOR_DEFAULT_FILE_TEMPLATE to evaluate the file name:
    cmip6_cv.set_cur_dataset_attribute(
    cmip6_cv.FILE_NAME_TEMPLATE,
    cmip6_cv.CMOR_DEFAULT_FILE_TEMPLATE)
    Digging through the C code, i guess there is no way yet to make cmor_CV_checkFilename work with a DRS from the CV? Originally the file_name_template was defined in the input dataset attributes table but can not be recovered from the files global attributes. So in general, i guess it would be nice to have the DRS available from the CV table instead of the input file (also for the cmorization itself probably?). I am not sure what the purpose of the DRS entry in the CV file is after all...
  • consistent checks: The checks that PrePARE does and the checks during cmorization are not totally consistent if files are cmorized without the _cmip6_option. For example, those checks in

    cmor/Src/cmor.c

    Lines 3043 to 3050 in 5f8759e

    if (cmor_has_cur_dataset_attribute(GLOBAL_IS_CMIP6) == 0) {
    ierr += cmor_CV_checkSourceID(cmor_tables[nVarRefTblID].CV);
    ierr += cmor_CV_checkExperiment(cmor_tables[nVarRefTblID].CV);
    ierr += cmor_CV_checkFurtherInfoURL(nVarRefTblID);
    ierr += cmor_CV_checkGrids(cmor_tables[nVarRefTblID].CV);
    ierr += cmor_CV_checkParentExpID(cmor_tables[nVarRefTblID].CV);
    ierr += cmor_CV_checkSubExpID(cmor_tables[nVarRefTblID].CV);
    }
    are deactivated without the the _cmip6_option which makes my CORDEX tables work with cmor. However, PrePARE does not know about this and especially those checks in
    if cmip6_cv.check_subExpID(table) != 0:
    self.errors += 1
    and
    if cmip6_cv.check_parentExpID(table) != 0:
    self.errors += 1
    make it crash with my files and tables.
  • hard coded table names: PrePARE assumes some hard coded rules for finding CV, coordinates, grid tables, etc.. e.g., in
    cmip6_cv.set_cur_dataset_attribute(
    cmip6_cv.GLOBAL_CV_FILENAME,
    cmip6_cv.TABLE_CONTROL_FILENAME)
    This could be easily solved probably using command line arguments.

It seems that PrePARE is tightly coupled to CMIP6 vocabulary with _cmip6_option="CMIP6" although i could make it run with CORDEX tables using a few hacks and i wanted to document some of the pitfalls here. This could probably be fixed quite easily using something like a cmip6_option as a command line argument for PrePARE. However, it still feels not right...

This is probably related to:

@mauzey1
Copy link
Collaborator

mauzey1 commented Dec 12, 2022

@larsbuntemeyer Could you provide some CORDEX dataset files that I could try to debug PrePARE with? Thank you.

@mauzey1
Copy link
Collaborator

mauzey1 commented Jan 23, 2023

CMOR doesn't seem to use the values in the DRS section in CMIP6_CV.json. I'm not sure if it should. @durack1 @taylor13 Any thoughts on this?

Regarding the parts where cmor_CV_checkParentExpID and cmor_CV_checkSubExpID are causing PrePARE to crash, I think this is due to the attributes parent_experiment_id and sub_experiment_id not being found for experiment entries in CORDEX_CV.json.

I think CMOR/PrePARE should be checking required_global_attributes of the CV to see if those attributes need to be present rather than assuming they are required by the MIP. However, I don't see the parent_experiment_id in CMIP6's required global attributes so maybe checking inside the experiment entry for the parent and sub would work better.

@durack1 obs4MIPS and input4MIPs don't have parent and sub experiments. How do they use CMOR/PrePARE?

I agree with using command line arguments with PrePARE to get the right files for the CV.

@larsbuntemeyer
Copy link
Author

Hi @mauzey1 ! Thanks for picking this up again, sorry, for being unresponsive! I'll provide some CORDEX dataset soon!

CMOR doesn't seem to use the values in the DRS section in CMIP6_CV.json

Yes, that's confusing. It uses file and path templates from the dataset input table, not from the CV. On the other hand, PrePARE has no option to override CMOR_DEFAULT_FILE_TEMPLATE and FILE_NAME_TEMPLATE...

I think CMOR/PrePARE should be checking required_global_attributes of the CV to see if those attributes need to be present rather than assuming they are required by the MIP

I think that is achieved by the _cmip6_option, see also my confusion in this comment: #679 (comment). Again, Prepare does not allow to mirror this behaviour during the checks...

@larsbuntemeyer
Copy link
Author

I am little bit tempted to implement some backend CV check module purely in python that could be utilized by PrePARE. Json handling and NetCDF attribute handling has really become easy and flexible within python.

@mauzey1
Copy link
Collaborator

mauzey1 commented Jan 25, 2023

I am little bit tempted to implement some backend CV check module purely in python that could be utilized by PrePARE. Json handling and NetCDF attribute handling has really become easy and flexible within python.

@larsbuntemeyer That is actually an idea I have been contemplating. I was thinking of making one based on the xarray. We could discuss that more on the discussion page.

As for the current issue with PrePARE, I wonder how we could tell if we are using a CV other than CMIP6 CV. I guess we could grab the prefix of the CV file assuming that the name of the CV file will always have the format <MIP name>_CF.json. Knowing if the CV is not for CMIP6, we could skip the parent and sub experiment checks. However, I wonder if other MIPs also use parent and sub experiments.

@durack1 @matthew-mizielinski Do you know of any MIPs that use parent and sub experiments other than CMIP6?

@durack1
Copy link
Contributor

durack1 commented Jan 25, 2023

You folks are starting to get aligned with what we've been discussing. We are thinking of following a structure similar to
220714_mip-cmor-tables - note this has evolved a little, but is fine for discussion.

If we follow this structure, then the <MIPName>_CV.json could be assumed, therefore allowing an update to PrePARE and CMOR to work across multiple projects without having to implement per-project hacks.

Note, we are trying to glean as much information about CMOR usage in the draft CMOR Users Survey, so if we have more specific info we want to collect from the community as part of planning, then feel free to suggest augmentations/edits to the questions we have drafted

@larsbuntemeyer
Copy link
Author

larsbuntemeyer commented Jan 30, 2023

@larsbuntemeyer Could you provide some CORDEX dataset files that I could try to debug PrePARE with? Thank you.

@mauzey1 You can now download current CORDEX example datasets from artifacts, e.g., here: https://github.com/WCRP-CORDEX/cordex-cmip6-cmor-tables/actions/runs/4043896243

@matthew-mizielinski
Copy link

@durack1 @matthew-mizielinski Do you know of any MIPs that use parent and sub experiments other than CMIP6?

Parent information: yes, particularly where they are extending CMIP6, e.g. run ssp245 with sulphate injections to keep global
average temperature below a target.

sub-experiment id: yes, where there are initialised experiments such as those for the decadal/seasonal forecasting area.

@taylor13
Copy link
Collaborator

@matthew-mizielinski : Aren't those considered CMIP6 runs, even if they are not the DECK/historical runs?

@matthew-mizielinski
Copy link

@matthew-mizielinski : Aren't those considered CMIP6 runs, even if they are not the DECK/historical runs?

Sorry, I should have put examples. Both of the following are outside of CMIP6;

  • for the parent information I have a fork of the CMIP6 MIP tables that I used for the Met Office data being provided to the ARISE SAI project (variables the same, but changed to allow a different MIP era, activity id and experiment id)
  • for the sub-experiment id then SNAPSI have a small set of experiments with sub-experiment ids used for the start/initialisation dates -- in this case I think there is a short period (a few months) of data being required at 6 hourly frequency.

@mauzey1
Copy link
Collaborator

mauzey1 commented Feb 2, 2023

I'm currently making changes to PrePARE to make its checks more like those done when cmorizing datasets. Below is the relevant section of cmor.c that does those checks.

cmor/Src/cmor.c

Lines 3038 to 3062 in 304db22

if (cmor_has_cur_dataset_attribute(GLOBAL_ATT_INSTITUTION_ID) == 0) {
ierr += cmor_CV_setInstitution(cmor_tables[nVarRefTblID].CV);
}
if (cmor_has_cur_dataset_attribute(GLOBAL_IS_CMIP6) == 0) {
ierr += cmor_CV_checkSourceID(cmor_tables[nVarRefTblID].CV);
ierr += cmor_CV_checkExperiment(cmor_tables[nVarRefTblID].CV);
ierr += cmor_CV_checkFurtherInfoURL(nVarRefTblID);
ierr += cmor_CV_checkGrids(cmor_tables[nVarRefTblID].CV);
ierr += cmor_CV_checkParentExpID(cmor_tables[nVarRefTblID].CV);
ierr += cmor_CV_checkSubExpID(cmor_tables[nVarRefTblID].CV);
}
//
// Set user defined attributes and explicit {} sets.
//
ierr += cmor_CV_checkGblAttributes(cmor_tables[nVarRefTblID].CV);
//
// Copy block to ensure all attributes are set for obs4MIPs
// especially (source_label)
//
if ( cmor_current_dataset.furtherinfourl[0] != '\0') {
ierr += cmor_CV_checkSourceID(cmor_tables[nVarRefTblID].CV);
ierr += cmor_CV_checkFurtherInfoURL(nVarRefTblID);
}

The check for source ID is done if the GLOBAL_IS_CMIP6 variable is true, or if current dataset's further info URL is not an empty string. Should the source ID check be done by default? Isn't that attribute essential for all MIPs?

@larsbuntemeyer
Copy link
Author

larsbuntemeyer commented Feb 3, 2023

The check for source ID is done if the GLOBAL_IS_CMIP6 variable is true, or if current dataset's further info URL is not an empty string. Should the source ID check be done by default? Isn't that attribute essential for all MIPs?

That would also impact the cmorization itself, not only PrePARE, right? That would work for CORDEX with CMIP6. But i know that people are also using the old CVs with cmor3, in those, there is no source_id but it uses the old model_id vocabulary, so that might be a problem @sol1105 ?

@mauzey1
Copy link
Collaborator

mauzey1 commented Feb 3, 2023

@durack1 Why is CMOR checking if the further_info_url is not an empty string before checking the source ID? The comment above this section of code suggests that it is related to obs4MIPs even though obs4MIPs_CV.json doesn't contain anything for further_info_url. Should checking further_info_url be a CMIP6-only check?

cmor/Src/cmor.c

Lines 3055 to 3062 in 304db22

//
// Copy block to ensure all attributes are set for obs4MIPs
// especially (source_label)
//
if ( cmor_current_dataset.furtherinfourl[0] != '\0') {
ierr += cmor_CV_checkSourceID(cmor_tables[nVarRefTblID].CV);
ierr += cmor_CV_checkFurtherInfoURL(nVarRefTblID);
}

@durack1
Copy link
Contributor

durack1 commented Feb 4, 2023

@mauzey1 good question, the further_info_url (supported by ES-Docs, but I am unable to find a working example at this moment) is indeed supported for obs4MIPs (see https://github.com/PCMDI/obs4MIPs-cmor-tables/blob/master/inputs/pcmdi/RSS/RSS_prw_v07r01.json#L6), but all this hard coding is not ideal. We don't have further_info_url support for input4MIPs, so what would be the case for that project?

@mauzey1
Copy link
Collaborator

mauzey1 commented Mar 30, 2023

Getting back into working on this issue, I've dug deeper into how the further_info_url attribute is used by PrePARE. When PrePARE calls the function check_furtherinfourl, it will retrieve the further info URL template from the variable cmor_current_dataset.furtherinfourl.

cmor/Src/cmor_CV.c

Lines 373 to 377 in 8d45339

/* -------------------------------------------------------------------- */
/* Retrieve default Further URL info */
/* -------------------------------------------------------------------- */
strncpy(szFurtherInfoURLTemplate, cmor_current_dataset.furtherinfourl,
CMOR_MAX_STRING);

cmor_current_dataset.furtherinfourl was initialized with the default value for the URL template.
#define CMOR_DEFAULT_FURTHERURL_TEMPLATE "https://furtherinfo.es-doc.org/<mip_era><institution_id><source_id><experiment_id><sub_experiment_id><variant_label>"

cmor_current_dataset.furtherinfourl would either use the value of further_info_url stored in the CV file or in the user input if CMOR was being used to write data. However, PrePARE doesn't do this so it will just use the default template.

Another issue that I have noticed is the difference between the default value of the further info URL's template, and the value stored in the further_info_url attribute of CV files. CMIP6_CV.json has the following for this attribute.

        "further_info_url":[
            "https://furtherinfo.es-doc.org/.*"
        ]

Although this appears to be a regex for matching any URL that begins with https://furtherinfo.es-doc.org/, the following lines from cmor_CV_checkFurtherInfoURL will make the function just ignore checking the URL due to the lack of "<>" tokens.

cmor/Src/cmor_CV.c

Lines 379 to 385 in 8d45339

/* -------------------------------------------------------------------- */
/* If this is a string with no token we have nothing to do. */
/* -------------------------------------------------------------------- */
szToken = strtok(szFurtherInfoURLTemplate, "<>");
if (strcmp(szToken, cmor_current_dataset.furtherinfourl) == 0) {
return (0);
}

So even if PrePARE got the value for further_info_url from the CV, it would just be ignoring the URL if the value was not a template that had "<>" tokens like the default in cmor.h.

The check for the further info URL is pretty much hardwired for CMIP6, or at least any project that used the same template for further info URLs.

Should we place the further info URL check within the group of CMIP6 checks, which can be disabled by the PrePARE user?

In CMOR, the check_sourceID fucntion would be called if cmor_current_dataset.furtherinfourl is not an empty string.

cmor/Src/cmor.c

Lines 3055 to 3062 in 8d45339

//
// Copy block to ensure all attributes are set for obs4MIPs
// especially (source_label)
//
if ( cmor_current_dataset.furtherinfourl[0] != '\0') {
ierr += cmor_CV_checkSourceID(cmor_tables[nVarRefTblID].CV);
ierr += cmor_CV_checkFurtherInfoURL(nVarRefTblID);
}

This would mean that the checks for further info URLs and source_ids would be performed unless the user set further_info_url to be an emtpy string. I can understand doing this for the URL but not the source_id.

Shouldn't check_sourceID be done regardless of the further info URL, or whether or not the project is CMIP6?

@durack1
Copy link
Contributor

durack1 commented Mar 30, 2023

@mauzey1 thanks for continuing to chip away at these issues. further_info_url, which uses the es-doc URL, is only applicable to the CMIP6 and obs4MIPs project, within input4MIPs we define this as a URL pointing elsewhere, and for some of the earlier generated source_id entries these don't exist. I also believe that for the latest datasets, even obs4MIPs has deprecated the ES-Doc usage (@gleckler1 can confirm)

@durack1
Copy link
Contributor

durack1 commented Mar 30, 2023

Shouldn't check_sourceID be done regardless of the further info URL, or whether or not the project is CMIP6?

And this is exactly correct. The way that we have things set up, if we don't have a source_id defined (in CVs) that matches the dataset being written, then the program should throw a warning at minimum.

@taylor13 @matthew-mizielinski can confirm their agreement

@taylor13
Copy link
Collaborator

Here are some offline comments, I'm including in this thread. Hope it's the right thread.

The global attributes specifications for CMIP6 included:
“ further_info_url has the form https://furtherinfo.es-doc.org/<mip_era>.<institution_id>.<source_id>.<experiment_id>.<sub_experiment_id>.<variant_label> (e.g., https://furtherinfo.es-doc.org/CMIP6.CAS_FGOALS-g3.historical.none.r3i1p1f1). The further_info_url page will be maintained by the es-docs project and will simply be a rendering by the Viewer tool of information provided by modeling groups and recorded in so-called CIM documents.”

The PrePARE default value is consistent with this BUT DOESN’T LOOK LIKE IT INCLUDES THE DOT (PERIOD) SEPARATORS:
#define CMOR_DEFAULT_FURTHERURL_TEMPLATE https://furtherinfo.es-doc.org/<mip_era><institution_id><source_id><experiment_id><sub_experiment_id><variant_label>

I agree with Paul that we want to check source_id independent of further_info_url.

Reviewing some of the coding, it seems to me it is going to difficult to generalize it cleanly to handle other projects (obs4MIPs, input4MIPs, etc.). I have spent some time trying to write down what we would like to check with PrePARE and how we might code this in a general way to ingest simply formatted guidance from an input “configuration” file which would specify which attributes to check and how to check them. My ideas are contained in this document: https://docs.google.com/document/d/1JZcVRo2GucGUoZPeUNrmHmtB8LAs0QBAWiDuf0n3lSM/edit. You might also be interested in an earlier document I put together when PrePARE was originally being developed: https://docs.google.com/document/d/1d4_wdaY52xhLZBTeiTu2ZlpmpTO2Luqe47m2IX1WkWg .

I think it is worth considering how difficult it would be to implement the above from "scratch" (as opposed to modifying the current PrePARE code). I think it might be easier to start over and might take less time than trying to clean up and generalize the existing coding.

Perhaps we can all get together and you can propose what you think would be the best way to proceed.

@durack1
Copy link
Contributor

durack1 commented Apr 7, 2024

@mauzey1 it would be useful to update PrePARE alongside the last planned release of CMOR3 to enable consistency checking as much as possible - let's revisit this issue as the final changes for 3.9.0 are identified to see what can be achieved within available time/resources

@taylor13
Copy link
Collaborator

taylor13 commented May 6, 2024

There is a lot of stuff considered above. Not sure we should spend resources generalizing PrePARE to handle the generality of the 3.9.0 release. Might want to wait on this for integration into the CMOR 4.0 development. @mauzey1, you might be in the best position to estimate how big the job. What's your advice on this?

@durack1
Copy link
Contributor

durack1 commented May 6, 2024

@taylor13 I agree with this, if we have a "final" CV template that we can build PrePARE4 (along CMOR4) to work with, then I suggest that's the best path forward, rather than attempting to wrangle two codebases (C - CMOR, and Python - PrePARE) together.

@mauzey1 if you agree with this, we can tag this against the 4.0 milestone

@mauzey1
Copy link
Collaborator

mauzey1 commented May 6, 2024

@durack1 Yes, I agree with your suggestion of making this a CMOR4 milestone.

@durack1 durack1 modified the milestones: 3.9.0, 4.0/Future May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants