-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enhancement for forecast data #623
Comments
Hello @taylor13 : thanks for raising this and mapping the requirements accurately into the CMOR framework. This looks good. It may be worth adding that the "reftime" coordinate should have Can we make it a requirement that "reftime" has the same I think "reftime" also needs a The requirement for "leadtime" comes, I believe, from the need to work with tools which can aggregate data across multiple files based on coordinates that are explicitly represented, as "leadtime" will be, but which are not currently able to work with a "leadtime" which is defined implicitly. It occurs to be that this redundancy is avoided in vertical coordinates by using |
It will be useful to engage @matthew-mizielinski @mauzey1 and @piotrflorek on this issue |
Hi All, After a little thought I think I understand this as a step towards datasets containing a single variable for an ensemble of initialised simulations with different start dates. If we consider the case where we have a decadal forecasting ensemble with members started every year from 2000 to 2005 each running for 10 years then we could imagine a single file with an ensemble dimension to cover the a set period, say 2005-2010, at which point the resulting files would look something like (ignoring bounds variables and global attributes)
In this case the reftime would contain the equivalent of Adding It would also be possible to construct examples where there is a significant amount of missing data, e.g. if the file in question covered 2000-2005 with the same start dates for the realizations then around half of the data variable would be missing data and the corresponding If I've interpreted this correctly then the document that @taylor13 has linked to above is not recommending the above structure, but a step towards it, with only a single ensemble member per file and as a result the I think an argument could be made for retaining the realization as a dimension if only to allow CMOR to support the general case. |
@matthew-mizielinski : regarding your last two paragraphs: yes, this is a recommendation for including additional CF metadata about |
In that case, I think we shouldn't try to extend CMOR to handle multiple simulations (start times) in a single file; that would require major changes. |
Thanks @martinjuckes, I've had a chat with Leon about this (this was the initial motivation for us getting more involved in CMOR). Given that there is a tool for ensemble combination and @taylor13's suggestion that major changes would be needed we'll start with the single ensemble member case. @taylor12, @durack1, If you are happy with the following I'll leave @piotr-florek-mo and @mauzey1 to start discussing API details. We need to add the following functionality;
|
@matthew-mizielinski an additional time axis is a fairly major change, but will certainly aid the use of CMOR for forecast production generation. It would be nice to build out the CI testing as this work begins, I have a feeling that unless we have pretty good code coverage some quirks could be introduced, I am sure that @taylor13 will have some insights |
@martinjuckes (cc @matthew-mizielinski @piotr-florek-mohc ) |
I apologize that there may be some delay before I can check this out in detail. I also apologize that you have had to figure out how to do this without much background from the original code writers. Perhaps it is not too late to be helpful in that regard. CMOR's purpose is to facilitate writing model output conforming to a project's data specifications. It is most useful for community modeling activities, similar to CMIP, where multiple groups wish to share model output that can be read in a common way. The idea is that many of the descriptive elements that comprise much of the CMIP metadata are the same for all models and can therefore be recorded in tables that CMOR can access. The user needs to tell CMOR which entry in the table is relevant, but otherwise should only need to provide CMOR with the actual model data being written and certain model-specific information (e.g., model name, model grid information, model calendar). By relying on tables provided by CMIP, data writers are prevented from introducing errors regarding, for example, the description of variables (standard_name, cell_methods, etc.) and certain global attributes. CMOR also is meant to check that for information the user must supply, it follows certain conformance rules (e.g., is of the correct data type and structure, is ordered correctly, is self-consistent, etc.) Finally, CMOR can transform input data in various ways to make it conform to the data specs (e.g., scale data to make it consistent with the requested units, restructure data so that the domain spanned and the direction of the coordinates is consistent with requirements, modify (if necessary in the case of time-mean data) the time coordinate values to conform to the CMIP rule that they should be at the middle of the averaging periods, etc.). In its original release, CMOR performed its checks only before writing the data and informed the user when it encountered any problems/errors. In later releases, a CMOR conformance checker was built that could perform some (but not all) of CMOR checks on data that had been written without CMOR. I am less familiar with the "checker" than I am with CMOR (original). As we extend CMOR to handle forecast data, we could simply enable users to add the additional "reftime" global attribute and the "leadtime" time coordinate (with appropriate variable attributes). As I understand it, this has now been accomplished (although perhaps not fully tested yet). If that is all that has been, that may be sufficient, but I can envision CMOR being more helpful both to data writers and to the projects they contribute to if CMOR were more ambitious. [Again, I have not yet looked in detail at the changes already implemented, so some of the following may already be implemented.] Resources permitting, ideally:
To be more specific,
We could also consider explicitly including the calendar attribute for reftime, but I would recommend that we might simply let it be known that the calendars for reftime must be the same as the calendar for the time coordinate. CMOR should perform a number of checks, including alerting the user when:
Note, there is an inquiry at #623 (comment) asking for clarification of what the requirements are for forecast time. I realize that making CMOR easy for others to use and helping guarantee that forecast data requirements are met may not be possible given resource constraints, but I hope it is. Otherwise, what has already been implemented in this regard will have to do (and perhaps that is more than I am aware of). |
Just realized, I left out the requirement that the file name also should include the reference date, so this too could be generated automatically by CMOR. Take the above as trying to cover the bases and make it easier for multiple groups to use cmor for forecast data. It's possible that a quasi-independent call to a single "subroutine" that is pretty much independent of CMOR could accomplish this (without getting into the guts of CMOR), which would be easier to code and easy enough for users, if not fully "integrated" with the workings of CMOR. I'm sure you've already considered these things. |
Also, would be nice if PrePARE (the CMOR checker) would be able to check that these new forecast requirements had been met. |
Hi Karl, |
I should say that I appreciate your strategy of implementing changes in the way that would likely minimize the chance of causing some subtle conflicts with the rest of CMOR. That was sensible and smart. So the idea of not trying to integrate them with the rest of the structure might still be the best way to go. But thanks for thinking about some other options. |
Hi all,
?
and in
|
I hadn't thought about the extra variables needed. We could, of course, define two tables (e.g., "Amon" and "Amonforecast"), which would be identical except for for the value of the "forecast" flag. In Amon for all variables it would be forecast=0, whereas in Amonforecast it would invariably be forecast=1. |
@piotr-florek-mohc and I have had a think about this and have come to the conclusion that the presence of @taylor13, are you happy with this or would you prefer that the The changes are still going to be a little rough around the edges within CMOR as we are trying to avoid being too disruptive to the existing code as this could easily take a significant amount of time for limited benefit. I am open to revising the approach once we understand the implications of the changes, i.e. feed our first approach into the next version of CMOR and then revise in the new year following experience of its use. When we start looking forward to CMIP7 I'm not 100% sure what the impacts are likely to be if we have some data with these forecast time coordinates (e.g. DCPP like) and some without (e.g. the DECK & Scenarios), so we might need to consult about this functionality being used within CMIP in the future. I could see us effectively having one set of tables with the forecast coordinates and another without, at which point we'd need to think about how best to maintain this. We could use separate entries in the same table as @wachsylon has suggested, but I am not fond of the situation we have at the moment with multiple variables pointing at the same |
As you say, a "forecast" flag isn't strictly needed in the CMOR table, because it can be inferred from the dimensions specification, and I too prefer a separate table for variables that are requested for forecasts. Eliminating the "forecast" flag from the table does, as you say, not change its structure, so that's a bit less disruptive. Two questions though:
|
Hi @taylor13 Regarding 1; @piotr-florek-mohc, could you comment here Regarding 2; I can imagine needing something like this if we attempt to combine the data from multiple forecasts within an ensemble into a single object. I recall seeing something where an ensemble of datasets was concatenated along a dimension with an associated ensemble coordinate. In this case we'd be looking at a non-scalar This feels like quite a big step to take with CMOR, and my instinct is that this would be for a downstream tool to handle rather than attempting to push all data through CMOR at once. From a workflow perspective it would be far quicker to push 10 datasets through CMOR separately in parallel and then carefully combine the results rather than push all data through a serial process. This isn't to say it couldn't be done, but I think until we have a requirement or clear motivation for this I would be tempted to wait. |
Hi Karl,
The latter (although I did it in a different way in the original implementation - I was checking for the presence of the reference time variable within ncdf). The main problem I have is that CMOR requires some preparatory work to set up variables (and it is during this step when mip tables are interrogated and meta-data pulled and ingested), and this preparatory work might or might not include the leadtime variable, depending on how much of this we would like to automate. (Also, if "leadtime" is mentioned in the "dimensions" string, CMOR would initially still expect it to be a real, non-auxiliary coordinate, increasing the number of temporal dimensions; so this needs to be tweaked to make both "reftime" and "leadtime" a special case for which this behaviour would be disabled.) |
thanks to both of you. I'm comfortable with your approach (not that you should care too much about that). If I understand what you say, @piotr-florek-mohc, for CMOR not to become confused, we must reserve "reftime" and "leadtime" for use as coordinate variables for the special case we're considering. I think that constraint on their use is acceptable. Or did you have something else in mind that you wanted me to comment on? |
Hi @taylor13 ,
after a short discussion with @matthew-mizielinski we came up with a more elegant solution, that is marking |
Nice! Would this be backward compatible with old coordinate tables that would have no such flag entries? (i.e., in the coordinate tables, would the flag be optional or omitted for all other types of coordinates?) |
Hi @taylor13 One of the complications is that |
Looking over the new features added to CMOR in #634, I've noticed that there is a function in cmor_func_def.h that is not defined anywhere. Line 111 in 847179f
Is the function I've also noticed that there is a Python interface for the function but not for Fortran. Should we have one? |
I'm guessing this issue can be closed since the lead time coordinate has already been added to CMOR via #634. Please reopen if further work is needed. |
The decadal prediction community have proposed an extension of CMIP6 standards to accommodate additional metadata when forecasts are stored by CMOR. Their proposal can be found at https://docs.google.com/document/d/1T-Xlkc07kzDbtyFp-JJiRQYmHO-OzFOgXjCvFGW5M7Y/edit .
From a brief look at that document, it appears that it will be a non-trivial enhancement that might include:
(See https://goo.gl/neswPr for further description of the CMIP6 requirements.)
For forecast data the "coordinates" attribute should include "reftime" and "leadtime" in its list. The "reftime" is a scalar coordinate and can probably be treated by CMOR just like the other scalar coordinates except it probably should not have a default value specified in the CMOR table. Also, CMOR may be currently limited to handling a single scalar coordinate for a given variable (not sure about this), so this might be an extension.
Storing the "leadtime" will likely require more work. This auxiliary coordinate variable must be the same size (length) as "time" and contains the time elapsed since the start of the forecast measured in days. It is equivalent to "time" except for the reference time is specified in the time coordinate's units attribute, whereas the reference time for "leadtime" is stored in "reftime". These two reference times will generally be different, so the "leadtime" will be offset from "time" by a constant amount. CMOR could generate the "leadtime" values, given the "time" coordinate values, by simply subtracting from "time" the difference between "reftime" and the reference time stored in the units attribute of "time". (Given this close and known relationship between leadtime and time, I wonder why "leadtime" is needed; wouldn't "reftime" be sufficient?)
The format for "sub_experiment" has been extended to include the "day" as well as year and month when the experiment was initiated. This seems like a minor change that should be easy to accommodate.
There are additional global attributes called for that include free text supplied by the user. I think CMOR can already handle these.
I'll try to get Paco to look this over and see if I've covered the requirements. Also, perhaps he can explain the rationale for including "leadtime".
The text was updated successfully, but these errors were encountered: