Function to combine/drop matching datasets with different grid labels #8

Timh37 · 2023-02-02T15:58:40Z

Related to #3, it would be good to have a custom function with the appropriate logic to choose which grid_label to use for all simulations of a model if there are multiple grid labels to choose from.

For example, we may have 'GFDL-CM4.gr1.ssp585.day.r1i1p1f1' and GFDL-CM4.gr2.ssp585.day.r1i1p1f1. We only want to keep one of these. A strategy may be to keep the grid label with the most available members. Ideally, we would also ensure that the same grid label is used for different SSPs.

The text was updated successfully, but these errors were encountered:

Timh37 · 2023-02-10T09:30:03Z

Something like:

def select_grid_with_most_members(ds_list):  
    num_members = np.array([len(ds.member_id) for ds in ds_list])
  
    return ds_list[np.argmax(num_members)]

with

combine_datasets(ddict,
                        select_grid_with_most_members,
                       match_attrs =['source_id','table_id'] )

could work, assuming that ddict contains datasets already merged on member_id's and experiments?

jbusecke · 2023-02-10T14:12:21Z

Interesting question. I think you might save yourself a lot of work by editing the dataframe with pandas before the loading step though. I am personally not a wiz with pandas but think this would be the right way to go

Timh37 · 2023-02-13T13:08:07Z

Thanks, I think this works nicely:

def reduce_to_max_num_realizations(cmip6_cat):
    '''Reduce grid labels in pangeo cmip6 catalogue by 
    keeping grid_label and 'ipf' identifier combination with most datasets (=most realizations if using require_all_on)'''
    df = cmip6_cat.df
    df['ipf'] = [s[s.find('i'):] for s in cmip6_cat.df.member_id] #extract 'ipf' from 'ripf'

    #generate list of tuples of (source_id,ipf,grid_label) that provide most realizations (note this will omit realizations not available at this grid but possibly at others)
    max_num_ds_tuples = df.groupby(['source_id','ipf'])['grid_label'].value_counts().groupby(level=0).head(1).index.to_list() #head(1) gives (first) max. value since value_counts sorts max to min
    df_filter = pd.DataFrame(max_num_ds_tuples,columns=['source_id','ipf','grid_label']) #generate df to merge catalogue df on
    
    df = df_filter.merge(right=df, on = ['source_id','ipf','grid_label'], how='left') #do the subsetting
    df = df.drop(columns=['ipf']) #clean up
    cmip6_cat.esmcat._df = df #(columns now ordered differently, probably not an issue?)
    
    return cmip6_cat

Timh37 · 2023-02-13T13:45:47Z

While cat_data=reduce_to_max_num_realizations(cat_data) above reduces the cataolgue nicely, e.g., compare

with

, when I call

kwargs = {
    'zarr_kwargs':{
        'consolidated':True,
        'use_cftime':True
    },
    'aggregate':False
}

ddict = cat_data.to_dataset_dict(**kwargs)

on the modified catalogue cat_data it still loads the files of the unmodified catalogue, as opposed to this example? Apparently cmip6_cat.esmcat._df = df is not sufficient? Any idea what's going on? @jbusecke

Timh37 · 2023-02-13T14:14:17Z

This seems to be related to the 'aggregate':False argument. If I remove that, the reduced set of datasets is correctly opened, although also aggregated which we don't want (because of potential runswith missing timesteps/unequal lengths). Apparently 'aggregate':False applies a deeopcopy that forgets the changes made to the catalogue, see this issue.

Timh37 · 2023-02-13T16:00:32Z

OK, for now I can work around this by setting 'aggregate'=True and setting cat_data.esmcat.aggregation_control.groupby_attrs = [] beforehand. That avoids the troublesome deepcopy of 'aggregate'=False without actually aggregating.

Timh37 added the enhancement New feature or request label Feb 2, 2023

Timh37 closed this as completed Feb 13, 2023

Timh37 reopened this Feb 13, 2023

Timh37 closed this as completed Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Function to combine/drop matching datasets with different grid labels #8

Function to combine/drop matching datasets with different grid labels #8

Timh37 commented Feb 2, 2023 •

edited

Loading

Timh37 commented Feb 10, 2023 •

edited

Loading

jbusecke commented Feb 10, 2023

Timh37 commented Feb 13, 2023

Timh37 commented Feb 13, 2023

Timh37 commented Feb 13, 2023 •

edited

Loading

Timh37 commented Feb 13, 2023 •

edited

Loading

Function to combine/drop matching datasets with different grid labels #8

Function to combine/drop matching datasets with different grid labels #8

Comments

Timh37 commented Feb 2, 2023 • edited Loading

Timh37 commented Feb 10, 2023 • edited Loading

jbusecke commented Feb 10, 2023

Timh37 commented Feb 13, 2023

Timh37 commented Feb 13, 2023

Timh37 commented Feb 13, 2023 • edited Loading

Timh37 commented Feb 13, 2023 • edited Loading

Timh37 commented Feb 2, 2023 •

edited

Loading

Timh37 commented Feb 10, 2023 •

edited

Loading

Timh37 commented Feb 13, 2023 •

edited

Loading

Timh37 commented Feb 13, 2023 •

edited

Loading