Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can we make intake-esm more transparent? #163

Closed
rabernat opened this issue Oct 18, 2019 · 10 comments
Closed

How can we make intake-esm more transparent? #163

rabernat opened this issue Oct 18, 2019 · 10 comments
Labels
discuss Topics for discussion. Might end in an enhancement or question label.

Comments

@rabernat
Copy link

I'm sitting with @naomi-henderson, and we are discussing how we might make intake-esm more transparent about what it's doing under the hood.

It would be nice if there were a mode where, rather than running the all the merge operations, intake returns a nested dictionary similar to the one I showed in my recursive merge demo

{'X': {'A': {'1': ds1, '2': ds2}, 'B': {'1': ds3, '2': ds4}},
 'Y': {'A': {'1': ds5, '2': ds6}, 'B': {'1': ds7', '2': ds8}}}

This would allow users to manually descend into the different individual datasets and examine them one a time, optionally applying their own merge logic.

This should be relatively easy, since intake-esm probably has an internal data structure like this already.

@matt-long
Copy link
Contributor

It should be relatively easy to return the nested dictionary.

A couple other ideas include enabling an aggregate=False option, which would return each of the individual datasets and a get_keys() method that would just return the keys that are build by the to_dataset_dict method.

@rabernat
Copy link
Author

enabling an aggregate=False option

👍

@rabernat
Copy link
Author

an aggregate=False option

More thoughts: how would this work? Would what would the keys be? Would it just group by all columns?

@matt-long
Copy link
Contributor

It would return a dataset for each row in the database. We could form keys from the groupby applied to all columns, but maybe it would be more accessible if the key was just the index. What do you think?

@rabernat
Copy link
Author

What would intake-esm currently do if there were no aggregation_control entry in the collection description?

@rabernat
Copy link
Author

Answer:

Raise KeyError: 'aggregation_control'

That is NOT the right behavior. Aggregation should be totally 100% optional in these catalogs.

@matt-long
Copy link
Contributor

matt-long commented Oct 18, 2019

Agreed, that's a bug, but easy to fix. Without aggregation_control the code forms groups over all columns:

groups = self.df.groupby(self.df.columns.tolist())

and the returned keys will be of the same format. We can trigger the same behavior if aggregate=False.

@andersy005
Copy link
Member

@naomi-henderson, @rabernat,

With #164 the following works:

import intake
col_file = "https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json"

col = intake.open_esm_datastore(col_file)
query = dict(experiment_id='historical', table_id='Oyr', 
                 variable_id='o2', grid_label='gn', member_id='r1i1p1f1')
cat = col.search(**query)



# Disable aggregations
dsets_pp = cat.to_dataset_dict(aggregate=False)
print(dsets_pp.keys())
--> The keys in the returned dictionary of datasets are constructed as follows:
	'zstore'

--> There will be 2 group(s)

dict_keys(['gs://cmip6/CMIP/CCCma/CanESM5/historical/r1i1p1f1/Oyr/o2/gn/', 'gs://cmip6/CMIP/IPSL/IPSL-CM6A-LR/historical/r1i1p1f1/Oyr/o2/gn/'])

@rabernat
Copy link
Author

@andersy005 - nice! However, I would prefer for the keys to be the groups, not the paths, as @matt-long suggested.

Are the keys the datasets themselves?

@andersy005
Copy link
Member

andersy005 commented Oct 19, 2019

Assuming that we have a row with the following attributes:

activity_id                                              AerChemMIP
institution_id                                                  BCC
source_id                                                  BCC-ESM1
experiment_id                                                ssp370
member_id                                                  r1i1p1f1
table_id                                                       Amon
variable_id                                                      pr
grid_label                                                       gn
zstore            gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...
dcpp_init_year                                                  NaN
Name: 0, dtype: object

I would prefer for the keys to be the groups, not the paths, as @matt-long suggested.

Should we have something along these lines?

{ 'AerChemMIP.BCC.BCC-ESM1.ssp370.r1i1p1f1.Amon.pr.gn.NaN' : 
  <xarray.Dataset>
Dimensions:    (bnds: 2, lat: 64, lon: 128, time: 492)
Coordinates:
  * lat        (lat) float64 -87.86 -85.1 -82.31 -79.53 ... 82.31 85.1 87.86
    lat_bnds   (lat, bnds) float64 dask.array<chunksize=(64, 2), meta=np.ndarray>
  * lon        (lon) float64 0.0 2.812 5.625 8.438 ... 348.8 351.6 354.4 357.2
    lon_bnds   (lon, bnds) float64 dask.array<chunksize=(128, 2), meta=np.ndarray>
  * time       (time) object 2015-01-16 12:00:00 ... 2055-12-16 12:00:00
    time_bnds  (time, bnds) object dask.array<chunksize=(492, 2), meta=np.ndarray>
Dimensions without coordinates: bnds
Data variables:
    pr         (time, lat, lon) float32 dask.array<chunksize=(492, 64, 128), meta=np.ndarray>
Attributes:
    Conventions:            CF-1.7 CMIP-6.2
    activity_id:            AerChemMIP
    further_info_url:       https://furtherinfo.es-doc.org/CMIP6.BCC.BCC-ESM1...
    grid:                   T42

@andersy005 andersy005 pinned this issue Nov 1, 2019
@andersy005 andersy005 added discuss Topics for discussion. Might end in an enhancement or question label. and removed cmip6 labels May 7, 2020
@intake intake locked and limited conversation to collaborators Sep 18, 2022
@andersy005 andersy005 converted this issue into discussion #531 Sep 18, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
discuss Topics for discussion. Might end in an enhancement or question label.
Projects
None yet
Development

No branches or pull requests

3 participants