Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Challenge 31 - Advance user capabilities to handle data constraints when using CDSAPI #9

Open
RubenRT7 opened this issue Feb 20, 2024 · 8 comments
Assignees
Labels
ECMWF New feature or request Software Development Software development for Earth Sciences applications

Comments

@RubenRT7
Copy link
Contributor

RubenRT7 commented Feb 20, 2024

Challenge 31 - Advance user capabilities to handle data constraints when using CDSAPI

Stream 3 - Software Development for Earth Sciences applications

Goal

Create a python library that will allow users to embed additional intelligence onto their scripts to handle CDS Dataset constraints improving the accuracy of submitted requests via cds-api.

Mentors and skills


Challenge description

Problem: Currently constraints are just functional to users when using the web interactive download form. Constraints manage the availability of different combinations when user is filling the form, guiding users towards requests which are valid by activating or deactivating available options in the widgets. These constraints are exposed via cdsapi but hidden to users and not documented. Because of that CDS process many requests from users which are wrong in scope and finally fail. This is not good for the users, neither for the system.

Data/System to be used: To do this challenge, it is only required a Python development environment, and account on CDS (https://cds.climate.copernicus.eu/) and the cdsapi (https://cds.climate.copernicus.eu/api-how-to).

Solution: A python library that is able to access the constraints definition for a given dataset via CDSAPI, and decoded it on the client side allowing user to perform different actions:
- Get information about the scope and definition of a dataset scope via api (variables, time ranges, ...)
- Automatise the definition of a valid set of requests before submission via api.
- Implement automatic checks of data availability to trigger submission of requests (eg. data updated periodically)
- Check the validity of a request before submission via api.

Ideas for implementation: these have been introduced on previous paragraphs. Mentors will help to configure their accounts and cdsapi, understand the constraints definition file (json), facilitate the understanding of the system, provide guide on datasets and polish the functional scope of requirements.

Resulting libraries will be put on the hands of cdsapi users as to have broader visibility on the real availability of data allowing more accuracy on the submitted requests. On one hand this will benefit user efficiency accessing the system and in the other will reduce unnecessary traffic of requests to the system. This feature will extend the capabilities of the new CDS Engine and API.

@EsperanzaCuartero EsperanzaCuartero changed the title Challenge 09 - Advance user capabilities to handle data constraints when using CDSAPI Challenge 14 - Advance user capabilities to handle data constraints when using CDSAPI Feb 22, 2024
@EsperanzaCuartero EsperanzaCuartero added the Software Development Software development for Earth Sciences applications label Feb 22, 2024
@EsperanzaCuartero EsperanzaCuartero changed the title Challenge 14 - Advance user capabilities to handle data constraints when using CDSAPI Challenge 31 - Advance user capabilities to handle data constraints when using CDSAPI Feb 23, 2024
@RubenRT7 RubenRT7 added the ECMWF New feature or request label Mar 7, 2024
@cataalbu
Copy link

cataalbu commented Apr 7, 2024

Hi! What do the constraints look like? Also, can two datasets from the same family (like ERA5, for example) have different constraints?

@ecmwf-cobarzan
Copy link

ecmwf-cobarzan commented Apr 7, 2024

Hi Catalin,

What do the constraints look like?

Constraints are represented as JSON files. In principle, a constraints file (one per dataset) is a list of dictionaries having:

  • as keys: widget names (aka dimensions in the data cubes of available data; more details below)
  • as values: the list of covered values for a widget.

Here is an example (for use in the context of this challenge only):

[
{"source": ["anthropogenic"], "version": ["latest", "v4.2"], "variable": ["acetylene", "acids", "alcohols", "ammonia", "benzene", "black_carbon", "butanes", "carbon_dioxide", "carbon_dioxide_excl_short_cycle", "carbon_monoxide", "chlorinated_hydrocarbons", "esters", "ethane", "ethene", "ethers", "formaldehyde", "hexanes", "isoprene", "ketones", "methane", "monoterpenes", "nitrogen_oxides", "non_methane_vocs", "organic_carbon", "other_aldehydes", "other_alkenes_alkynes", "other_aromatics", "other_vocs", "pentanes", "propane", "propene", "sulphur_dioxide", "toluene", "trimethylbenzenes", "xylenes"], "year": ["2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019", "2020"]},
{"source": ["anthropogenic"], "version": ["v2.1"], "variable": ["acetylene", "acids", "alcohols", "ammonia", "benzene", "black_carbon", "butanes", "carbon_dioxide", "carbon_monoxide", "chlorinated_hydrocarbons", "esters", "ethane", "ethene", "ethers", "formaldehyde", "hexanes", "isoprene", "ketones", "methane", "monoterpenes", "nitrogen_oxides", "non_methane_vocs", "organic_carbon", "other_aldehydes", "other_alkenes_alkynes", "other_aromatics", "other_vocs", "pentanes", "propane", "propene", "sulphur_dioxide", "toluene", "trimethylbenzenes", "xylenes"], "year": ["2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019"]},
{"source": ["anthropogenic"], "version": ["v2.1"], "variable": ["acetylene", "acids", "ammonia", "benzene", "black_carbon", "butanes", "carbon_dioxide", "carbon_monoxide", "chlorinated_hydrocarbons", "esters", "ethane", "ethene", "ethers", "formaldehyde", "hexanes", "isoprene", "ketones", "methane", "monoterpenes", "nitrogen_oxides", "non_methane_vocs", "organic_carbon", "other_aldehydes", "other_alkenes_alkynes", "other_aromatics", "other_vocs", "pentanes", "propane", "propene", "sulphur_dioxide", "toluene", "trimethylbenzenes", "xylenes"], "year": ["2003", "2004", "2005", "2006", "2007", "2008", "2009"]},
{"source": ["aviation"], "version": ["latest", "v1.1"], "variable": ["acetylene", "alcohols", "ammonia", "benzene", "black_carbon", "carbon_dioxide", "carbon_monoxide", "ethane", "ethene", "formaldehyde", "hexanes", "ketones", "nitrogen_oxides", "non_methane_vocs", "organic_carbon", "other_aldehydes", "other_alkenes_alkynes", "other_aromatics", "other_vocs", "pentanes", "propane", "propene", "sulphur_dioxide", "toluene", "trimethylbenzenes", "xylenes"], "year": ["2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019", "2020"]},
{"source": ["biogenic"], "version": ["latest", "v3.0", "v3.1"], "variable": ["acetaldehyde", "acetic_acid", "acetone", "alpha_pinene", "beta_pinene", "butanes_and_higher_alkanes", "butenes_and_higher_alkenes", "carbon_monoxide", "ethane", "ethanol", "ethene", "formaldehyde", "formic_acid", "hydrogen_cyanide", "isoprene", "methane", "methanol", "methyl_bromide", "methyl_chloride", "methyl_iodide", "other_aldehydes", "other_ketones", "other_monoterpenes", "propane", "propene", "sesquiterpenes", "toluene"], "year": ["2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019"]},
{"source": ["biogenic"], "version": ["v1.1"], "variable": ["acetaldehyde", "acetic_acid", "acetone", "butanes_and_higher_alkanes", "butenes_and_higher_alkenes", "carbon_monoxide", "ethane", "ethanol", "ethene", "formaldehyde", "formic_acid", "hydrogen_cyanide", "isoprene", "methane", "methanol", "other_aldehydes", "other_ketones", "other_monoterpenes", "pinene", "propane", "propene", "sesquiterpenes", "toluene"], "year": ["2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015"]},
{"source": ["biogenic"], "version": ["v1.2"], "variable": ["acetaldehyde", "acetic_acid", "acetone", "butanes_and_higher_alkanes", "butenes_and_higher_alkenes", "carbon_monoxide", "ethane", "ethanol", "ethene", "formaldehyde", "formic_acid", "hydrogen_cyanide", "isoprene", "methane", "methanol", "other_aldehydes", "other_ketones", "other_monoterpenes", "pinene", "propane", "propene", "sesquiterpenes", "toluene"], "year": ["2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019"]},
{"source": ["oceanic"], "version": ["latest", "v3.1"], "variable": ["bromoform", "dibromomethane", "dimethyl_sulphide", "iodomethane"], "year": ["2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019"]},
{"source": ["oceanic"], "version": ["v2.1"], "variable": ["bromoform", "dibromomethane", "dimethyl_sulphide", "iodomethane"], "year": ["2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018"]},
{"source": ["shipping"], "version": ["latest", "v2.1"], "variable": ["ash", "carbon_dioxide", "carbon_monoxide", "elemental_carbon", "nitrogen_oxides", "organic_carbon", "sulphate", "sulphur_oxides", "vocs_all"], "year": ["2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018"]},
{"source": ["soil"], "version": ["latest", "v2.2"], "variable": ["nitrogen_oxides"], "year": ["2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018"]},
{"source": ["soil"], "version": ["v1.1"], "variable": ["nitrogen_oxides"], "year": ["2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015"]},
{"source": ["termites"], "version": ["latest", "v1.1"], "variable": ["methane"], "year": ["2000"]}
]

corresponding to this dataset. Beware constraints can evolve in time though.

Each such dictionary (i.e. a constraint) represents a complete data cube, i.e. all possible combinations of widget values in it correspond to existing data granules. The list of all constraints covers the universe of available data for a dataset.

Take for example the last constraint in the example above (i.e. last dictionary). Source, version, variable and year are the widget names/dimensions. The available data granules are (termites, latest, methane, 2000) and (termites, v1.1, methane, 2000).

Can two datasets from the same family (like ERA5, for example) have different constraints?

Yes. And that is typically the case, i.e. one constraint file per dataset. They can vary a lot:

  • from very brief constraint files (a few constraints; a few KB) to very large ones (thousands of constraints; a few MB)
  • from uniform looking constraints (all constraints looking the same except one dimension changing value, e.g. time), to very complex ones (where the constraints/data cubes vary strongly in terms of dimensionality).

If you have any other questions, please let us know.
Thank you for your interest!

Have a nice day!

Petrut COBARZAN & the team

@cataalbu
Copy link

cataalbu commented Apr 7, 2024

For example for this dataset https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-pressure-levels?tab=form
The api request generated from the interface is this one:

import cdsapi

c = cdsapi.Client()

c.retrieve(
    'reanalysis-era5-pressure-levels',
    {
        'product_type': 'reanalysis',
        'format': 'grib',
        'time': '00:00',
        'day': [
            '29', '30',
        ],
        'month': [
            '01', '02',
        ],
        'year': '2024',
        'pressure_level': '50',
        'variable': [
            'divergence', 'geopotential',
        ],
    },
    'download.grib')

By - Automatise the definition of a valid set of requests before submission via api. you mean that we should break on the client-side this request in two, one for {..., 'day': ['29'], 'month': ['01', '02'], ...} and one for {..., 'day': ['30;], 'month': ['01'], ...}, as the 30th of February does not exist?

@ecmwf-cobarzan
Copy link

Yes. That is a cleverly constructed example (that can be inferred without knowing the specific constraints for this dataset). Very good!

The initial request would be broken into (at least) these two sub-requests, which might be themselves broken into more fine-grained sub-requests (if necessary, and not necessarily in this order). The ultimate objective is to determine (and then submit for execution) a set of sub-requests for which the entire corresponding data cube is available.

Ideally, the union of this set of sub-requests would be equal to the intersection between the client's initial request/selection and the set of constraints/available data cubes. Also, the set would be pairwise disjoint (so that no data granule is covered more than once). Ultimately, the set would be as small as possible (so that we perform the minimum possible number of requests). However, the size of each individual request should be (generally) small enough so that the CDS engine does not get clogged with large requests.

@cataalbu
Copy link

cataalbu commented Apr 8, 2024

The implementation of the solution will be integrated into the cdsapi?

@ecmwf-cobarzan
Copy link

Yes (subject to the quality of the resulting solution, of course). Development could be carried in a fork of the repository or as a totally independent solution.

@cataalbu
Copy link

cataalbu commented Apr 8, 2024

In this dataset, I think the "Pressure level" section is optional. How are these optional keys defined in a constraint?

@ecmwf-cobarzan
Copy link

In situations where the constraints/data cubes vary in terms of dimensionality, some widgets/dimensions are not required in certain selection combinations. In the example above, the pressure level is only relevant for multi-level variables. In such cases, the constraints concerning single-level variables would not contain the pressure level dimension, while the ones concerning the multi-level variables might.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ECMWF New feature or request Software Development Software development for Earth Sciences applications
Projects
None yet
Development

No branches or pull requests

7 participants