Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom aggregation #147

Merged
merged 11 commits into from Feb 8, 2023
Merged

Custom aggregation #147

merged 11 commits into from Feb 8, 2023

Conversation

aulemahal
Copy link
Collaborator

@aulemahal aulemahal commented Jan 31, 2023

Pull Request Checklist:

  • This PR addresses an already opened issue (for bug fixes / features)
    • This PR fixes #xyz
  • (If applicable) Documentation has been added / updated (for bug fixes / features)
  • HISTORY.rst has been updated (with summary of main changes)
    • Link to issue (:issue:number) and pull request (:pull:number) has been added

What kind of change does this PR introduce?

New to_dataset method on DataCatalog.

Same as to_dask, but exposes options to change the aggregation control :

  • concat_on to list columns over which the datasets are concatenated.
  • ensemble_on to list columns over which a realization dimension is created.

The goal of this function is to reduce code complexity is some common cases where one wants a dataset with all members and experiments (for examples).

I also added a "good to know" page to the doc. A place where to list all sorts of misc information that users of xscen should be aware of. The first section is about how to open data.

Example:

cat = xs.DataCatalog("/you/know/where/ESPO-extra.json")

ds = cat.search(
    bias_adjust_project='ScenGen', xrfreq='QS-DEC'
).to_dataset(concat_on=['experiment'], ensemble_on=['institution', 'source'])

The output:

<xarray.Dataset>
Dimensions:                   (experiment: 2, realization: 11, time: 605, lat: 320, lon: 416)
Coordinates:
  * lat                       (lat) float32 66.62 66.54 66.46 ... 40.12 40.04
  * lon                       (lon) float32 -89.05 -88.96 ... -54.55 -54.46
  * time                      (time) datetime64[ns] 1949-12-01 ... 2100-12-01
  * experiment                (experiment) object 'rcp45' 'rcp85'
  * realization               (realization) object 'CCCma_CanESM2' ... 'NOAA-...
...

The same code with pure xscen:

dss = []
for exp in ['rcp45', 'rcp85']:
    cats = xs.search_data_catalogs(
        "/you/know/where/ESPO-extra.json",
        variables_and_freqs={ind: 'QS-DEC' for ind in indicators},
        other_search_criteria={'domain': 'QC'},
    )
    dsd = {}
    for dsid, cat in cats.items():
        dsd[dsid] = xs.extract_dataset(cat)['QS-DEC']
    dss.append(xclim.ensembles.create_ensembles(dsd, calendar='standard', resample_freq='QS-DEC')
ds = xr.concat(dss, xr.DataArray(['rcp45', 'rcp85'], dims=('experiment',), name='experiment'))

The plus value of this PR seems evident to me here.

Where ?

As it was developed for an ensemble case, I implemented it as a single-dataset output. But that seems a bit limited. In a xscen-world, this could be implemented at the search_data_catalogs level.

However, it felt to me that the search_data_catalogs/extract_dataset combo is best used for raw data. Once at the ensemble step, simples DataCatalog.search are often enough. Thus, the need to have this on DataCatalog.

Also, this could be moved upstream to intake-esm, but the ensemble_on and calendar args seem a bit to xclim-specific...

@aulemahal aulemahal added documentation Improvements or additions to documentation enhancement New feature or request labels Feb 3, 2023
@aulemahal aulemahal marked this pull request as ready for review February 3, 2023 23:09
Copy link
Contributor

@juliettelavoie juliettelavoie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the good to know page is really neat!
I am having issues using to_dataset. Unclear to me if I don't understand what it is suppose to do or it is not working...

docs/goodtoknow.rst Show resolved Hide resolved
docs/goodtoknow.rst Outdated Show resolved Hide resolved
xscen/catalog.py Outdated Show resolved Hide resolved
xscen/catalog.py Show resolved Hide resolved
docs/columns.rst Outdated Show resolved Hide resolved
docs/goodtoknow.rst Outdated Show resolved Hide resolved
docs/goodtoknow.rst Outdated Show resolved Hide resolved
docs/goodtoknow.rst Outdated Show resolved Hide resolved
docs/goodtoknow.rst Outdated Show resolved Hide resolved
docs/goodtoknow.rst Outdated Show resolved Hide resolved
docs/goodtoknow.rst Outdated Show resolved Hide resolved
docs/goodtoknow.rst Outdated Show resolved Hide resolved
docs/goodtoknow.rst Outdated Show resolved Hide resolved
xscen/catalog.py Outdated Show resolved Hide resolved
xscen/catalog.py Outdated Show resolved Hide resolved
xscen/catalog.py Outdated Show resolved Hide resolved
xscen/catalog.py Outdated Show resolved Hide resolved
xscen/catalog.py Outdated Show resolved Hide resolved
xscen/catalog.py Show resolved Hide resolved
@aulemahal
Copy link
Collaborator Author

@RondeauG I re-edited the docstring because it felt a bit redundant.

xscen/catalog.py Outdated Show resolved Hide resolved
Co-authored-by: RondeauG <38501935+RondeauG@users.noreply.github.com>
Copy link
Contributor

@RondeauG RondeauG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good to go!

@aulemahal aulemahal merged commit 8e44005 into main Feb 8, 2023
@aulemahal aulemahal deleted the custom-agg branch February 8, 2023 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants