Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean-up of confusing meta data values #368

Closed
22 tasks
huard opened this issue Feb 14, 2020 · 10 comments
Closed
22 tasks

Clean-up of confusing meta data values #368

huard opened this issue Feb 14, 2020 · 10 comments
Assignees
Labels
enhancement New feature or request standards / conventions Suggestions on ways forward
Milestone

Comments

@huard
Copy link
Collaborator

huard commented Feb 14, 2020

See bird-house/finch#80

Description

Indicators carry over metadata from the original file, but some of them may not be valid anymore (e.g. frequency) or make the output confusing. We should decide what to do with them (keep, change, update, remove).

For example, should we store the original attributes in a string ?

CMIP5 attributes

  • branch_time
  • contact
  • Conventions
  • creation_date
  • experiment
  • experiment_id
  • forcing
  • frequency
  • initialization_method
  • institute_id
  • institution
  • model_id
  • modeling_realm
  • parent_experiment_id
  • parent_experiment_rip
  • physics_version
  • product
  • project_id
  • realization
  • source
  • table_id
  • tracking_id
@Zeitsperre Zeitsperre self-assigned this Feb 14, 2020
@Zeitsperre Zeitsperre added enhancement New feature or request standards / conventions Suggestions on ways forward labels Feb 20, 2020
@Zeitsperre Zeitsperre modified the milestones: v0.15, v0.16 Feb 20, 2020
@Zeitsperre Zeitsperre modified the milestones: v0.16, v0.17 Apr 23, 2020
@huard huard modified the milestones: v0.17, v0.18, v0.19 May 12, 2020
@Zeitsperre Zeitsperre modified the milestones: v0.19, v0.20 Aug 5, 2020
@Zeitsperre
Copy link
Collaborator

@Ouranosinc/xclim-core We need to have a meeting about this specifically.

@huard
Copy link
Collaborator Author

huard commented Sep 1, 2020

@Zeitsperre Please propose a date, time and agenda.

@Zeitsperre
Copy link
Collaborator

Zeitsperre commented Sep 1, 2020

@Ouranosinc/xclim-core
Proposed talking points (feel free to modify this message as needed):

  • What metadata do we presently affix into Indicators/indices?
  • How do we currently handle external metadata?
  • Which metadata standards and fields should we provide explicit support for?
    • How should we handle exceptions to these standards/accepted fields?
  • What fields should we be adding?
    • How can these fields support current project?

@Zeitsperre
Copy link
Collaborator

Some thoughts:

  1. CF-standartd global attributes should not be found in variable attributes. This is an error/misleading. I would remove these, if/when they are found. This complicates how we write our history. From the summary write-up of CF-1.9 (proposed):

    2.6.1 Identification of Conventions
    Requirements:
    > The Conventions attribute must be a single text string containing a list of convention names separated by blank space or commas, one of which shall be the full CF string as described below.
    > Files that conform to the CF version 1.8 conventions must indicate this by setting the global Conventions attribute to contain the CF string value "CF-1.8".

    2.6.2 Description of File Contents
    Requirements:
    > The title, history, institution, source, references, and comment attributes are all type string.
    Recommendations:
    > The title and history attributes are only defined as global or groups attributes. If they are used as per variable attributes a CF compliant application should treat them exactly as it would treat any other unrecognized attribute.

  2. We should be following the most recent (stable) CF-Conventions guidelines. It's up to the users to ensure they adjust their practices to adhere to more modern standards.

  3. Terms that fall outside the standard should be carried over as much as possible, excepting fields that would provide confusing information (creation_date is not a useful attribute when the indicator is newly created; we should be updating this).

Attributes handled within variables:

  • branch_time --> Carried
  • contact --> Carried
  • Conventions --> Removed
  • creation_date --> Updated
  • experiment --> Carried
  • experiment_id --> Carried
  • forcing --> Carried
  • frequency --> Removed
  • history --> Becomes notes
  • initialization_method--> Carried
  • institute_id --> Carried
  • institution --> Carried
  • model_id --> Carried
  • modeling_realm --> Carried
  • parent_experiment_id --> Carried
  • parent_experiment_rip --> Carried
  • physics_version --> Carried
  • product --> Carried
  • project_id --> Carried
  • realization --> Carried
  • source --> Carried
  • table_id --> Carried
  • title -- Removed
  • tracking_id --> Carried

Given that Indicators don't touch global attributes (AFAIK), should the onus be on users to ensure that they write out the proper scaffolding (History and Conventions`) when processing their files? The one case I can see where this might present problems is when it comes to using the CLI (Global Attributes are not carried over to the output file). Is addressing this opening a can of worms?

Feel free to chime in with ideas, opinions or potential "gotchas". I'm starting a PR to address some of these problems next week.

@huard
Copy link
Collaborator Author

huard commented Sep 11, 2020

Shouldn't' frequency reflect the output's frequency ?
Not a fan of "notes" for history, but I can live with it if there is a rationale for it. We already use notes for the math formula in the global attributes.

I note that history adds a <No available history>\n if the input file has no history. I don't think this is useful.

We only get the global attributes if we call the json method, but there is stuff in there that would not go in global attributes. I think we should come up with a clean way to convert a computation into a dataset that includes global attributes, as it would clarify some of the issues raised here.

@aulemahal
Copy link
Collaborator

aulemahal commented Sep 11, 2020

To help the discussion, here are all indicator attributes and their translation.

Carried to variables (unique to each output):

  • standard_name
  • long_name (formatted)
  • units
  • cell_methods (merged with those from the inputs)
  • description (formatted)
  • comment
  • history (generated and merged with those from the inputs)

Not carried to variables (available through Ind.json, unique to the indicator):

  • title
  • abstract (no corresponding CF attr)
  • keywords (no corresponding CF attr)
  • references
  • notes (could be used as the global comment CF attr)

I would prefer keeping history in the variable attributes, unless we make the Indicator produce datasets instead of DataArrays. It could be useful to have a method that transforms an Indicator output to a dataset, that was the idea behind the dropped "dataset_output" option we tried when implementing the multi-output Indicators. Without this option, the process has to be in two parts with the user referencing the indicator twice, ex:

out = xc.atmos.Indicator(in1, in2, *params)  # Computes data and adds variable attributes
ds = xc.out_to_ds(out, xc.atmos.Indicator)  # Moves "history" and adds global attributes

About global attributes: I'm not sure the generic Indicator "title" really is fit as a global attribute of a computed dataset. Isn't it generally too general about the used parameters while being too specific about the indicator? Also, notes (-> comment) is usually long and has a docstring layout that might not be fit for a netCDF attribute? (rst and tex markup, multiple lines) That leaves, "references" (often absent) + moving history, does that merit a new function?

@Zeitsperre
Copy link
Collaborator

Zeitsperre commented Sep 11, 2020

I would prefer keeping history in the variable attributes, unless we make the Indicator produce datasets instead of DataArrays.

As it stands, CF dictates that having history in a variable's attributes is non-standard. This is true even for reanalysis datasets that don't typically follow CF. If we had the option to produce Datasets as well as DataArrays, we

It could be useful to have a method that transforms an Indicator output to a dataset, that was the idea behind the dropped "dataset_output" option we tried when implementing the multi-output Indicators. Without this option, the process has to be in two parts with the user referencing the indicator twice, ex:

out = xc.atmos.Indicator(in1, in2, *params)  # Computes data and adds variable attributes
ds = xc.out_to_ds(out, xc.atmos.Indicator)  # Moves "history" and adds global attributes

I like where this proposal is going, but I can see the problem. It would be interesting to be able to send a Dataset to Indicators (e.g. xc.atmos.corn_heat_units(*, ds: Optional[xr.Dataset], da: Optional[xr.DataArray], tas="tas"... etc.). It breaks all function conventions we currently use, but we are currently breaking some CF conventions, so... worth considering for a serious breaking version (v1.0?).

Maybe we can extend xarray's to_netcdf methods to look for presence of an xclim Indicator?

About global attributes: I'm not sure the generic Indicator "title" really is fit as a global attribute of a computed dataset. Isn't it generally too general about the used parameters while being too specific about the indicator? Also, notes (-> comment) is usually long and has a docstring layout that might not be fit for a netCDF attribute? (rst and tex markup, multiple lines) That leaves, "references" (often absent) + moving history, does that merit a new function?

Titles are usually descriptive of the source data, e.g.

// global attributes:
		:title = "IPSL  model output prepared for IPCC Fourth Assessment SRES A2 experiment" ;

I think the history information is perfectly fine, but we need to be prefixing it to the global attributes' history, which can be quite long anyway (for some heavily corrected data, anyway). I do think we need to be able to touch/modify global attributes. If we have that capability, everything becomes a lot easier to standardize.

@huard
Copy link
Collaborator Author

huard commented Sep 11, 2020

Note that some of the fields are intended to feed into WPS process descriptions:

  • Title | Title of the process, input, and output. Normally available for display to a human. | ows:Title | One (mandatory)
  • Abstract | Brief narrative description of a process, input, and output. Normally available for display to a human. | ows:Abstract | Zero or one (optional) Include when available and useful.
  • Keywords | Keywords that characterize a process, its inputs, and outputs. | ows:Keywords | Zero or more (optional) Include when available and useful.
  • Identifier | Unambiguous identifier of a process, input, and output. | ows:Identifier Value is a URI or HTTP-URI a | One (mandatory)
  • Metadata | Reference to additional metadata about this item. | ows:Metadata Allowed values are specified in Table 5. | Zero or more (optional)

@huard
Copy link
Collaborator Author

huard commented Sep 11, 2020

We could simply have Indicator.to_dataset(da, thresh=...), which could run __call__ internally and then create the dataset.

@aulemahal
Copy link
Collaborator

As #559 is merged, moving the rest of this to a later milestone.

@aulemahal aulemahal modified the milestones: v0.20, v0.22 Sep 18, 2020
@huard huard removed this from the v0.22 milestone Oct 5, 2020
@Zeitsperre Zeitsperre added this to the v1.0 milestone Jan 4, 2021
@huard huard closed this as completed Jan 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request standards / conventions Suggestions on ways forward
Projects
None yet
Development

No branches or pull requests

3 participants