Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Function to combine/drop matching datasets with different grid labels #8

Closed
Timh37 opened this issue Feb 2, 2023 · 6 comments
Closed
Labels
enhancement New feature or request

Comments

@Timh37
Copy link
Owner

Timh37 commented Feb 2, 2023

Related to #3, it would be good to have a custom function with the appropriate logic to choose which grid_label to use for all simulations of a model if there are multiple grid labels to choose from.

For example, we may have 'GFDL-CM4.gr1.ssp585.day.r1i1p1f1' and GFDL-CM4.gr2.ssp585.day.r1i1p1f1. We only want to keep one of these. A strategy may be to keep the grid label with the most available members. Ideally, we would also ensure that the same grid label is used for different SSPs.

@Timh37 Timh37 added the enhancement New feature or request label Feb 2, 2023
@Timh37
Copy link
Owner Author

Timh37 commented Feb 10, 2023

Something like:

def select_grid_with_most_members(ds_list):  
    num_members = np.array([len(ds.member_id) for ds in ds_list])
  
    return ds_list[np.argmax(num_members)]

with

combine_datasets(ddict,
                        select_grid_with_most_members,
                       match_attrs =['source_id','table_id'] )

could work, assuming that ddict contains datasets already merged on member_id's and experiments?

@jbusecke
Copy link
Collaborator

Interesting question. I think you might save yourself a lot of work by editing the dataframe with pandas before the loading step though. I am personally not a wiz with pandas but think this would be the right way to go

@Timh37
Copy link
Owner Author

Timh37 commented Feb 13, 2023

Thanks, I think this works nicely:

def reduce_to_max_num_realizations(cmip6_cat):
    '''Reduce grid labels in pangeo cmip6 catalogue by 
    keeping grid_label and 'ipf' identifier combination with most datasets (=most realizations if using require_all_on)'''
    df = cmip6_cat.df
    df['ipf'] = [s[s.find('i'):] for s in cmip6_cat.df.member_id] #extract 'ipf' from 'ripf'

    #generate list of tuples of (source_id,ipf,grid_label) that provide most realizations (note this will omit realizations not available at this grid but possibly at others)
    max_num_ds_tuples = df.groupby(['source_id','ipf'])['grid_label'].value_counts().groupby(level=0).head(1).index.to_list() #head(1) gives (first) max. value since value_counts sorts max to min
    df_filter = pd.DataFrame(max_num_ds_tuples,columns=['source_id','ipf','grid_label']) #generate df to merge catalogue df on
    
    df = df_filter.merge(right=df, on = ['source_id','ipf','grid_label'], how='left') #do the subsetting
    df = df.drop(columns=['ipf']) #clean up
    cmip6_cat.esmcat._df = df #(columns now ordered differently, probably not an issue?)
    
    return cmip6_cat

@Timh37 Timh37 closed this as completed Feb 13, 2023
@Timh37
Copy link
Owner Author

Timh37 commented Feb 13, 2023

While cat_data=reduce_to_max_num_realizations(cat_data) above reduces the cataolgue nicely, e.g., compare

image

with

image

, when I call

kwargs = {
    'zarr_kwargs':{
        'consolidated':True,
        'use_cftime':True
    },
    'aggregate':False
}

ddict = cat_data.to_dataset_dict(**kwargs)

on the modified catalogue cat_data it still loads the files of the unmodified catalogue, as opposed to this example? Apparently cmip6_cat.esmcat._df = df is not sufficient? Any idea what's going on? @jbusecke

@Timh37 Timh37 reopened this Feb 13, 2023
@Timh37
Copy link
Owner Author

Timh37 commented Feb 13, 2023

This seems to be related to the 'aggregate':False argument. If I remove that, the reduced set of datasets is correctly opened, although also aggregated which we don't want (because of potential runswith missing timesteps/unequal lengths). Apparently 'aggregate':False applies a deeopcopy that forgets the changes made to the catalogue, see this issue.

@Timh37
Copy link
Owner Author

Timh37 commented Feb 13, 2023

OK, for now I can work around this by setting 'aggregate'=True and setting cat_data.esmcat.aggregation_control.groupby_attrs = [] beforehand. That avoids the troublesome deepcopy of 'aggregate'=False without actually aggregating.

@Timh37 Timh37 closed this as completed Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants