-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Function to combine/drop matching datasets with different grid labels #8
Comments
Something like: def select_grid_with_most_members(ds_list):
num_members = np.array([len(ds.member_id) for ds in ds_list])
return ds_list[np.argmax(num_members)] with combine_datasets(ddict,
select_grid_with_most_members,
match_attrs =['source_id','table_id'] ) could work, assuming that ddict contains datasets already merged on member_id's and experiments? |
Interesting question. I think you might save yourself a lot of work by editing the dataframe with pandas before the loading step though. I am personally not a wiz with pandas but think this would be the right way to go |
Thanks, I think this works nicely: def reduce_to_max_num_realizations(cmip6_cat):
'''Reduce grid labels in pangeo cmip6 catalogue by
keeping grid_label and 'ipf' identifier combination with most datasets (=most realizations if using require_all_on)'''
df = cmip6_cat.df
df['ipf'] = [s[s.find('i'):] for s in cmip6_cat.df.member_id] #extract 'ipf' from 'ripf'
#generate list of tuples of (source_id,ipf,grid_label) that provide most realizations (note this will omit realizations not available at this grid but possibly at others)
max_num_ds_tuples = df.groupby(['source_id','ipf'])['grid_label'].value_counts().groupby(level=0).head(1).index.to_list() #head(1) gives (first) max. value since value_counts sorts max to min
df_filter = pd.DataFrame(max_num_ds_tuples,columns=['source_id','ipf','grid_label']) #generate df to merge catalogue df on
df = df_filter.merge(right=df, on = ['source_id','ipf','grid_label'], how='left') #do the subsetting
df = df.drop(columns=['ipf']) #clean up
cmip6_cat.esmcat._df = df #(columns now ordered differently, probably not an issue?)
return cmip6_cat |
While with , when I call kwargs = {
'zarr_kwargs':{
'consolidated':True,
'use_cftime':True
},
'aggregate':False
}
ddict = cat_data.to_dataset_dict(**kwargs) on the modified catalogue |
This seems to be related to the |
OK, for now I can work around this by setting |
Related to #3, it would be good to have a custom function with the appropriate logic to choose which
grid_label
to use for all simulations of a model if there are multiple grid labels to choose from.For example, we may have
'GFDL-CM4.gr1.ssp585.day.r1i1p1f1'
andGFDL-CM4.gr2.ssp585.day.r1i1p1f1
. We only want to keep one of these. A strategy may be to keep the grid label with the most available members. Ideally, we would also ensure that the same grid label is used for different SSPs.The text was updated successfully, but these errors were encountered: