Grib filtering really slow #640

jpkeskinen · 2020-07-01T10:09:19Z

Hi,

I'm trying to cmorise the AerChemMIP historical run but the processing of the ifs output takes a lot of time. The grib filtering part seems to be the main problems with the last task taking about 41 hours. The whole ece2mor for ifs takes about 13 days. Also, the time required by the grib filtering appears to increase during the process.

Our computing centre is guessing that the problem could be due to the parallelisation as currently ece2cmor runs only on one core in our machine. Which parts of ece2cmor are supposed to work in parallel?

I am currently using version 1.5. I used version 1.4 earlier and it worked much faster with (almost) the same data.

Any ideas on how to get ece2cmor work a bit faster?

jpkeskinen · 2020-07-01T10:10:39Z

Here's a log of a ece2cmor run.
HIS2-ifs-001-20200612023731.log

treerink · 2020-07-01T10:32:58Z

Hi Jukka-Pekka,

First some questions:

So how long does the cmorisation of 1 year of IFS data take?

Are you cmorising model level data? I guess no? If no, did you set --skip_alevel_vars?

The current last release of ece2cmor3 is v1.4 only the latest master is further. The next version v1.5 has not been released yet.

jpkeskinen · 2020-07-01T11:14:35Z

One year of IFS data took me 13 days 5 hours 47 minutes and 1 second to process. I have been able to cmorise only one year so far but there are currently four more jobs running.
I don't know if I am cmorising model level data. I've just tried to follow whatever instructions I've found. In any case, it's the AerChemMIP historical experiment and I'm not attempting anything extra. I did not set --skip_alevel_vars but this was because Philippe writes in the EC-Earth portal that "when running ece2cmor, do NOT use the --skip_alevel_vars option."
Sorry, I mixed up the version numbers. I'm using now 1.4 and used 1.3 earlier.

goord · 2020-07-01T11:26:22Z

Hi @jpkeskinen what kind of storage is you raw output on? It's not some network mounted disc right?

jpkeskinen · 2020-07-01T11:40:47Z

Hi Gijs,

The raw EC-Earth output is on our scratch. I couldn't find any further specifics other than it is a "Lustre parallel storage system".

goord · 2020-07-01T11:41:29Z

Can you do a speed test and copy a year of data from your model output location to your temporary data folder (the one you specify in ece2cmor3 invocation)?

goord · 2020-07-01T12:09:32Z

hmm strange, lustre should work fine. The filtering is the slow part of the processing, but it shouldn't take more than 15 min. for standard resolution IFS on lustre... but do the copy of the grib files just to be sure it's not an issue with your storage please

jpkeskinen · 2020-07-01T12:13:46Z

I've tried two different locations for the temporary data: the scratch and the local scratch. The latter is supposed to be faster and I have currently ece2cmor running (almost 4 days gone and still filtering the grib files) using that.

Copying a year of raw ifs data takes 17 min 43.200 s to the regular scratch and 8 min 28.244 s to the local scratch.

goord · 2020-07-01T12:42:26Z

From your log I see the post-processing and cmorization also takes abnormally long time. Which machine are you running on?

jpkeskinen · 2020-07-01T12:50:57Z

The computer is CSC's Puhti. It's an Atos cluster with Intel Xeon processors (Cascade Lake) at 2,1 GHz.

goord · 2020-07-01T13:53:04Z

Hi @jpkeskinen maybe it is something with the conda installation? Can you use the conda-installed cdo to do a simple task (select some grib code and do e.g. a spectral transform) with this? So in your submit script, first activate the ece2cmor conda environment and then cdo -f nc -monmean -sp2gpl -selcode,130 <output-file> ta.nc.

It may also be you are using too many threads, but that shouldn't affect the filtering too much, which uses only 2 threads

jpkeskinen · 2020-07-02T10:22:33Z

The cdo command gives me an error:

Warning (cgribexScanTimestep2): Record 104 (id=133.128 lev1=100000 lev2=0) timestep 2: Parameter not defined at timestep 1!
cdo(3) selcode: Open failed on >ICMGGHIS2+185007<
Unsupported file structure

jpkeskinen · 2020-07-02T11:17:20Z

I managed to work around the problem with grib_filter (thanks to Tommi). I now have two grib files, one presumably for model levels and the other for pressure levels. The cdo command takes about 1 second for the pressure level file and about 2 minutes for the model level file.

etiennesky · 2020-07-02T11:32:30Z

in the end what was the solution? It might be helpful for others in the same situation

jpkeskinen · 2020-07-02T12:24:35Z

in the end what was the solution? It might be helpful for others in the same situation

The original problem is still there. I was just referring to the cdo test cdo -f nc -monmean -sp2gpl -selcode,130 <output-file> ta.nc requested by Gijs.

Declan-FMI · 2020-07-02T15:07:31Z

I have the same problem as Jukka-Pekka, both on puhti and on cca at ECMWF. On cca, even when modifying only the paths in ece2cmor/scripts/submit_script_examples/submit-at-cca-ece2cmor-leg-job.sh, which Thomas says works for him, and specifying --skip-alevel-vars (for test purposes), I still get the same exponentially growing execution times in grib_filter.

treerink · 2020-07-02T15:56:46Z

@plesager you are cmorising similar ECE-AerChem results on cca with reasonable speed, right?

goord · 2020-07-02T21:31:39Z

Ok it seems I broke the performance of the grib filter during previous release... I will have a look at it tomorrow

plesager · 2020-07-03T07:38:32Z

@plesager you are cmorising similar ECE-AerChem results on cca with reasonable speed, right?

Cmorizing one year of IFS (including model levels) with 1.4, I have:

on cca (ECMWF): 2.5 to 2.8 hours (18 threads, max in nf queue). One experiment, variability due to system load.
on rhino (KNMI Atos//bull): 3.45 to 4.25 hours (28 threads). Very steady, the difference stems from experiment data requests.

Because I cmorize leg N-2 when running leg N, I've not paid too much attention to the performance (as long as cmorization is faster than EC-Earth I'm happy) and I do not have the numbers from the previous set of runs I did with 1.3 (I vaguely remember to be closer to 3 hours for one leg of IFS).

goord · 2020-07-03T07:46:52Z

Perhaps @Declan-FMI and @jpkeskinen you are using too few or too many threads? Can you post the submission script?

jpkeskinen · 2020-07-03T08:01:18Z

The 13 day run was done with a single process. This was because I was unable to get any clear advantage on using more. For this reason I suspect I am doing something wrong regarding the parallelisation of ece2cmor.

I am not that familiar with threads but I am trying different things. The submission script I used for the currently running jobs (5 and ½ days gone and still filtering the grib files) is here:
ece2cmor-log-array-job-historical-ifs.sh.txt

goord · 2020-07-03T08:47:40Z

Hi @jpkeskinen what happens if you omit the '#SBATCH -n 1" line? This may confuse the workload manager, it can deduce from your -c option how many resources to allocate. You should give ece2cmor3 at least 2 processes because that allows the grib filter to process the gridpoint file and spectral file in parallel.

It the problem persists you will need to profile the application on your system, it will be an interesting exercise!

Declan-FMI · 2020-07-03T09:32:11Z

Here are the scripts I used on cca / puhti. I have tried cutting the number of cores, increasing the memory per core, using eccodes instead of grib-api, reverting to ece2cmor1.3.0 (yes I know it is not compatible with ec-earth 3.3.2.1), none of those helped.
Module evnironment on puhti:Currently Loaded Modules:

intel/18.0.5 4) cdo/1.9.7.1 7) hdf5/1.10.4-mpi 10) grib-api/1.24.0
intel-mkl/2018.0.5 5) nco/4.9.3 8) netcdf/4.7.0 11) hpcx-mpi/2.4.0
hdf/4.2.13 6) ncview/2.1.7 9) netcdf-fortran/4.4.4
and on cca:
Currently Loaded Modulefiles:
modules/3.2.10.5(default) 15) dmapp/7.0.1-1.0502.11080.8.76.ari
eswrap/1.3.3-1.020200.1280.0 16) rca/1.0.0-2.0502.60530.1.62.ari
switch/1.0-1.0502.60522.1.61.ari 17) atp/2.1.0(default)
intel/17.0.3.053(default) 18) PrgEnv-intel/5.2.82
craype-broadwell 19) jasper/1.900.1(default)
craype-network-aries 20) grib_api/1.24.0(nov17)
craype/2.5.9(default) 21) nco/4.9.2(jun20:new:default)
pbs/13.0.403.161593 22) subversion/1.13.0(new:default)
cray-mpich/7.5.3(default) 23) python/2.7.15-01(jun20:new:jun19:default)
cdt/17.03 24) git/2.26.0(jun20:new:default)
cray-libsci/16.11.1(default) 25) cray-netcdf-hdf5parallel/4.4.1.1(default)
udreg/2.3.2-1.0502.10518.2.17.ari 26) cray-hdf5-parallel/1.10.0.1(default)
ugni/6.0-1.0502.10863.8.29.ari 27) cdo/1.9.8(jun20:new:default)
pmi/5.0.11(default)

cca-ece2cmor.sh.txt
cmorize_hist_ens3.sh.txt

jpkeskinen · 2020-07-03T10:25:50Z

Hi @jpkeskinen what happens if you omit the '#SBATCH -n 1" line? This may confuse the workload manager, it can deduce from your -c option how many resources to allocate. You should give ece2cmor3 at least 2 processes because that allows the grib filter to process the gridpoint file and spectral file in parallel.

I can try this next and I'll report back what happens. However, I'm starting my holidays today and will be back in August. Thank you for the help so far!

goord · 2020-07-03T10:54:40Z

@plesager can you spot the difference with your setup? Are you sure you are actually running the 1.4 version?

@jpkeskinen I am too leaving for holidays (now actually). When I get back the 10th I will have a look at the issue at ECMWF. @Declan-FMI I propose you make your output files (or at least one leg) at ECMWF readable to all so I can run the cmorization myself and profile the application. You can also try to use the parallel queue and see if it is faster. I can start on this when I get back i.e. Monday 13th.

Declan-FMI · 2020-07-03T11:28:12Z

Output files under
/scratch/ms/fi/fi2k/aerchemmip/HPI2/output
should be readable (in fact my whole $SCRATCH)

plesager · 2020-07-03T13:11:35Z

I confirm that I'm using 1.4 on cca. From the log:

2020-07-03 09:15:13 INFO:ece2cmor3.ece2cmorlib: Metadata attribute history: processed by ece2cmor v1.4.0, git rev. local unknown branch

plesager · 2020-07-03T15:45:10Z

That cca-ece2cmor.sh.txt is not easy to compare with my script. Could you point me to a script it created (one of its ${job_name})? I spotted one difference though. I have this line:
export HDF5_USE_FILE_LOCKING=FALSE
in my script. Not sure if that makes any difference.

Declan-FMI · 2020-07-03T16:01:53Z

I have overwritten the generated script a few times today and it is currently running in an effort to profile the grib_filter part of ece2cmor3. So the script below differs a little from what was generated by the script I posted earlier (just added some stuff to set up extrae, module load papi extrae and the 6 lines below it).
cmor_HPI2-ifs-001.sh.txt

plesager · 2020-07-03T16:15:40Z

Ok, so another difference, but again I doubt it chances anything. I do not have:

export OMP_NUM_THREADS=18

Declan-FMI · 2020-07-03T21:19:06Z

No, that last one I have tried with/without, makes no difference whatsoever. Unsurprisingly, since OpenMP does not enter into the equation at any point.

Declan-FMI · 2020-07-13T10:40:32Z

Still no progress. On cca, Philippe has kindly shared his post-processing script, but despite using this with minimal adaptation, and installing the same conda version as Philippe is using (4.7.12), I still get these huge increases in grib_filter times. At last attempt, the second ICMSH file took 1 hour.

goord · 2020-07-13T12:42:37Z

OMP_NUM_THREADS should not make any difference, ece2cmor3 uses subprocesses, not openmp threads. I will have a look this afternoon Declan

goord · 2020-07-15T11:58:10Z

I see the same bottleneck Declan. I believe it is the model levels in the raw output, Philippe do you have any of those? I am working on a solution to 'fast-forward' through model level blocks if they aren't requested.

plesager · 2020-07-15T15:10:53Z

Yes, you can check the output in:
/scratch/ms/nl/nm6/ECEARTH-RUNS/onep/output/ifs/NNN
with NNN > 113. The corresponding control output dir is:
/scratch/ms/nl/nm6/3.3.2.1-aerchemmip-RT/classic/ctrl/cmip6-output-control-files-pextra/CMIP/EC-EARTH-AerChem/cmip6-experiment-CMIP-1pctCO2

goord · 2020-07-17T10:54:36Z

The main issue here seemed to be memory cleanup bug in the calls to grib api. I have also included a 'fast-forward' feature to skip model level blocks in the input grib files if not requested. I am doing the final testing now in the filter-fast-forward branch.

goord · 2020-07-17T11:39:26Z

My tests seem to be ok, @jpkeskinen and @Declan-FMI can you test the filter-fast-forward branch?

Declan-FMI · 2020-07-17T14:14:02Z

Now running on cca and puhti (Finnish supercomputer centre Bull machine). Let's see...if I understand rightly, this should speed up cmorization where model level data is not requested, but if it is requested we should not expect to see any difference. Or did I get something wrong ?

goord · 2020-07-17T14:41:14Z

The memory issue I solved should make everything faster. The fast-forward feature should help when model-level data is not requested (--skip-alevel-vars) but is present in your raw grib files. You can monitor the filtering a bit by just listing all file sizes in the temp directory btw.

Declan-FMI · 2020-07-17T14:48:16Z

It is clear even from the tail of the log files that this is running much faster now. Both on puhti and cca.

Declan-FMI · 2020-07-17T16:58:10Z

cmorization appears to have successfully completed on both puhti and cca within very acceptable time. I haven't made any detailed check of the results (which is why I wrote 'appears to'), but at least the present issue looks to be solved.

treerink · 2020-07-18T19:15:31Z

I got identical results with the filter-fast-forward branch with my test-all case, see #542. I was about to merge this branch into the master, but saw the latest Travis test of this branch failed which hold me of merging.

goord · 2020-07-22T08:39:51Z

Tests have been fixed, branch has been merged (squash)

treerink assigned goord Jul 3, 2020

treerink added the release 1.5 label Jul 3, 2020

goord closed this as completed Jul 22, 2020

Grib filtering really slow #640

Grib filtering really slow #640

Comments

jpkeskinen commented Jul 1, 2020

jpkeskinen commented Jul 1, 2020

treerink commented Jul 1, 2020

jpkeskinen commented Jul 1, 2020

goord commented Jul 1, 2020

jpkeskinen commented Jul 1, 2020

goord commented Jul 1, 2020

goord commented Jul 1, 2020

jpkeskinen commented Jul 1, 2020

goord commented Jul 1, 2020

jpkeskinen commented Jul 1, 2020

goord commented Jul 1, 2020

jpkeskinen commented Jul 2, 2020 • edited Loading

jpkeskinen commented Jul 2, 2020

etiennesky commented Jul 2, 2020

jpkeskinen commented Jul 2, 2020

Declan-FMI commented Jul 2, 2020

treerink commented Jul 2, 2020

goord commented Jul 2, 2020

plesager commented Jul 3, 2020

goord commented Jul 3, 2020

jpkeskinen commented Jul 3, 2020

goord commented Jul 3, 2020

Declan-FMI commented Jul 3, 2020

jpkeskinen commented Jul 3, 2020

goord commented Jul 3, 2020

Declan-FMI commented Jul 3, 2020

plesager commented Jul 3, 2020

plesager commented Jul 3, 2020

Declan-FMI commented Jul 3, 2020

plesager commented Jul 3, 2020

Declan-FMI commented Jul 3, 2020

Declan-FMI commented Jul 13, 2020

goord commented Jul 13, 2020

goord commented Jul 15, 2020

plesager commented Jul 15, 2020

goord commented Jul 17, 2020

goord commented Jul 17, 2020

Declan-FMI commented Jul 17, 2020

goord commented Jul 17, 2020

Declan-FMI commented Jul 17, 2020

Declan-FMI commented Jul 17, 2020

treerink commented Jul 18, 2020

goord commented Jul 22, 2020

jpkeskinen commented Jul 2, 2020 •

edited

Loading