Improve performance of `dfmt.preprocess_merge_meteofiles_era5()` #839

veenstrajelmer · 2024-05-13T16:11:14Z

Merging meteofiles in the modelbuilder using dfmt.preprocess_merge_meteofiles_era5() is quite slow. The example code below contains the bottleneck code. Reading the data is fast, but writing the merged data to netcdf is slow.

import os
import dfm_tools as dfmt
import datetime as dt

model_name = 'sss'
dir_base = r'p:\11210331-004-bes-modellering-2024\4_simulations\hydrodynamica\preprocessing\modelbuilder'
dir_output = os.path.join(dir_base, f'modelbuilder_output_{model_name}') 

date_min = '2020-12-01'
date_max = '2023-01-01'
time_slice = slice(date_min, date_max)

# define paths and pattern of source data
dir_data_era5 = os.path.join(dir_output, 'data', 'ERA5')
varlist_lists = [['msl','u10n','v10n','chnk'],['d2m','t2m','tcc'],['ssr','strd'],['mer','mtpr']]
varkey_list = varlist_lists[0]
fn_match_pattern = f'era5_.*({"|".join(varkey_list)})_.*.nc' #simpler but selects more files: 'era5_*.nc'
file_out_prefix = f'era5_{"_".join(varkey_list)}_'
preprocess = dfmt.preprocess_ERA5 #reduce expver dimension if present
file_nc = os.path.join(dir_data_era5, fn_match_pattern)

# read multifile dataset
data_xr_tsel = dfmt.merge_meteofiles(file_nc=file_nc, time_slice=time_slice, preprocess=preprocess)

# write to netcdf file (slow)
print('>> writing file (can take a while): ',end='')
dtstart = dt.datetime.now()
times_pd = data_xr_tsel['time'].to_series()
time_start_str = times_pd.iloc[0].strftime("%Y%m%d")
time_stop_str = times_pd.iloc[-1].strftime("%Y%m%d")
file_out = os.path.join(dir_output, f'{file_out_prefix}{time_start_str}to{time_stop_str}_ERA5.nc')
data_xr_tsel.to_netcdf(file_out)
print(f'{(dt.datetime.now()-dtstart).total_seconds():.2f} sec')

This could be faster by using chunks when reading the files, or using a different merging method. Some investigation is still needed.

The text was updated successfully, but these errors were encountered:

veenstrajelmer · 2024-05-14T11:41:41Z

The issue is with this part of the code:

dfm_tools/dfm_tools/xarray_helpers.py

Lines 190 to 191 in 633bbd0

    
           if "chunks" not in kwargs.keys(): 
        
               kwargs["chunks"] = {'time':1}

This makes xarray write all files per timestep instead of the default chunks (time=744 in case of the files above).

veenstrajelmer linked a pull request May 14, 2024 that will close this issue

optimized dfmt.merge_meteofiles() by using more generic chunking method #840

Merged

veenstrajelmer closed this as completed in #840 May 14, 2024

veenstrajelmer mentioned this issue May 14, 2024

Prepare 0.23.0 release #823

Closed

16 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of `dfmt.preprocess_merge_meteofiles_era5()` #839

Improve performance of `dfmt.preprocess_merge_meteofiles_era5()` #839

veenstrajelmer commented May 13, 2024

veenstrajelmer commented May 14, 2024

Improve performance of dfmt.preprocess_merge_meteofiles_era5() #839

Improve performance of dfmt.preprocess_merge_meteofiles_era5() #839

Comments

veenstrajelmer commented May 13, 2024

veenstrajelmer commented May 14, 2024

Improve performance of `dfmt.preprocess_merge_meteofiles_era5()` #839

Improve performance of `dfmt.preprocess_merge_meteofiles_era5()` #839