Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of dfmt.preprocess_merge_meteofiles_era5() #839

Closed
veenstrajelmer opened this issue May 13, 2024 · 1 comment · Fixed by #840
Closed

Improve performance of dfmt.preprocess_merge_meteofiles_era5() #839

veenstrajelmer opened this issue May 13, 2024 · 1 comment · Fixed by #840

Comments

@veenstrajelmer
Copy link
Collaborator

Merging meteofiles in the modelbuilder using dfmt.preprocess_merge_meteofiles_era5() is quite slow. The example code below contains the bottleneck code. Reading the data is fast, but writing the merged data to netcdf is slow.

import os
import dfm_tools as dfmt
import datetime as dt

model_name = 'sss'
dir_base = r'p:\11210331-004-bes-modellering-2024\4_simulations\hydrodynamica\preprocessing\modelbuilder'
dir_output = os.path.join(dir_base, f'modelbuilder_output_{model_name}') 

date_min = '2020-12-01'
date_max = '2023-01-01'
time_slice = slice(date_min, date_max)

# define paths and pattern of source data
dir_data_era5 = os.path.join(dir_output, 'data', 'ERA5')
varlist_lists = [['msl','u10n','v10n','chnk'],['d2m','t2m','tcc'],['ssr','strd'],['mer','mtpr']]
varkey_list = varlist_lists[0]
fn_match_pattern = f'era5_.*({"|".join(varkey_list)})_.*.nc' #simpler but selects more files: 'era5_*.nc'
file_out_prefix = f'era5_{"_".join(varkey_list)}_'
preprocess = dfmt.preprocess_ERA5 #reduce expver dimension if present
file_nc = os.path.join(dir_data_era5, fn_match_pattern)

# read multifile dataset
data_xr_tsel = dfmt.merge_meteofiles(file_nc=file_nc, time_slice=time_slice, preprocess=preprocess)

# write to netcdf file (slow)
print('>> writing file (can take a while): ',end='')
dtstart = dt.datetime.now()
times_pd = data_xr_tsel['time'].to_series()
time_start_str = times_pd.iloc[0].strftime("%Y%m%d")
time_stop_str = times_pd.iloc[-1].strftime("%Y%m%d")
file_out = os.path.join(dir_output, f'{file_out_prefix}{time_start_str}to{time_stop_str}_ERA5.nc')
data_xr_tsel.to_netcdf(file_out)
print(f'{(dt.datetime.now()-dtstart).total_seconds():.2f} sec')

This could be faster by using chunks when reading the files, or using a different merging method. Some investigation is still needed.

@veenstrajelmer
Copy link
Collaborator Author

The issue is with this part of the code:

if "chunks" not in kwargs.keys():
kwargs["chunks"] = {'time':1}

This makes xarray write all files per timestep instead of the default chunks (time=744 in case of the files above).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant