Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in on-the-fly Swift hydro run (extra property output not fully implemented?) #13

Closed
jchelly opened this issue Aug 10, 2020 · 6 comments

Comments

@jchelly
Copy link

jchelly commented Aug 10, 2020

I'm running Swift with on the fly velociraptor on the EAGLE_low_z/EAGLE_12 example in the Swift repository using the vrconfig_3dfof_subhalos_SO_hydro.cfg parameter file. My VR configuration is

cmake .. \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_CXX_FLAGS_RELEASE="-O3 -xAVX -g" \
    -DCMAKE_C_FLAGS_RELEASE="-O3 -xAVX -g" \
    -DCMAKE_POSITION_INDEPENDENT_CODE=ON \
    -DCMAKE_C_COMPILER=icc \
    -DCMAKE_CXX_COMPILER=icpc \
    -DVR_USE_SWIFT_INTERFACE=ON \
    -DVR_USE_HYDRO=ON

and for Swift I use

../configure \
    --enable-ipo \
    --enable-debug \
    --with-hdf5 \
    --with-fftw \
    --with-parmetis \
    --with-gsl \
    --with-tbbmalloc \
    --with-hydro=sphenix --with-kernel=wendland-C2 --with-subgrid=EAGLE-XL \
    --with-velociraptor=`pwd`/../../VELOCIraptor-STF/build/src

and I run it with

mpirun -np 2 ../../../build/examples/swift_mpi \
    --cosmology --eagle --velociraptor \
    --threads=16 eagle_12.yml

This crashes on the first VR invocation in substructureproperties.cxx line 6024:

        x = Pval[i].GetHydroProperties();

At this point Pval[i].hydro is a null pointer, which I think is what causes the crash because GetHydroProperties just does "return *hydro".

Looking at swiftinterface.cxx, Part.hydro is only set if the parameter swift_gas_parts was passed to InvokeVelociraptorHydro() and was not null:

#ifdef GASON
    if (swift_gas_parts != NULL)
    {
        for (auto i=0; i<num_hydro_parts; i++)
        {
            index = swift_gas_parts[i].index;
            parts[index].SetHydroProperties(hydro);
        }
        free(swift_gas_parts);
    }
#endif

It looks like output of extra properties has been only partially implemented for on the fly runs. Things that are missing:

  • InvokeVelociraptorHydro() is never called with with swift_[gas/bh/star]_parts set in Swift or VR
  • There's no code in Swift's velociraptor_interface.c to generate the contents of swift_[gas/star/bh]_parts
  • VR's swiftinterface.cxx calls Part.SetHydroProperties() with an uninitialised HydroProperties instance so the data wouldn't get copied to the velociraptor particles anyway

If I comment out the extra properties in the .cfg file then it survives a VR invocation without crashing.

@rtobar
Copy link

rtobar commented Aug 12, 2020

@jchelly thanks for the report. I'm trying to reproduce the error locally first, and got all the software compiled, downloaded the necessary data, and started running. I had to slightly tweak the EAGLE_low_z/EAGLE_12/eagle_12.yml configuration file to get things running though, and I'm wondering whether I am missing some other configuration bit needed to reproduce the issue. Could you attach both swift's and vr's config files here for reference? Thanks a lot.

I also notice that, even after 20 minutes, both ranks are still using only 100% of CPU, even though I'm running with --threads=10 (and also added OMP_NUM_THREADS=10 with no effect). Is this expected? This is my first time running swift so I'm not sure what to expect. Logs seem to suggest that everything should be running as normal:

[...]
 Config. options: '--enable-ipo --enable-debug --with-hdf5 --with-fftw --with-parmetis=/home/rtobar/local/parmetis/hyrmine/4.0.3/ --with-gsl=/opt/bldr/local/numerics/gsl/2.5 --with-tbbmalloc --with-hydro=sphenix --with-kernel=wendland-C2 --with-subgrid=EAGLE-XL --with-velociraptor=/home/rtobar/scm/git/swiftsim/build/../../VELOCIraptor-STF/builds/native/src --enable-compiler-warnings=yes MPI_CC=/opt/bldr/local/distributed/openmpi4/4.0.1/bin/mpicc'

 Compiler: GCC, Version: 4.8.5
 CFLAGS  : '-g -O0  -flto -gdwarf-2 -fvar-tracking-assignments -O3 -fomit-frame-pointer -malign-double -fstrict-aliasing -ffast-math -funroll-loops -march=core-avx2 -mavx2 -pthread -fopenmp -Wall -Wextra -Wno-unused-parameter -Wshadow -Wstrict-prototypes'

 HDF5 library version: 1.10.5
 FFTW library version: 3.x (details not available)
 GSL  library version: 2.5
 MPI library: Open MPI v4.0.1 (MPI std v3.1)
 ParMETIS library version: 4.0.3
[...]
[0000] [00023.4] main: Running on 6387423 gas particles, 0 sink particles, 256548 stars particles, 521 black hole particles and 6644672 DM particles (13289164 gravity particles)
[0000] [00023.4] main: from t=1.276e-02 until t=1.413e-02 with 2 ranks, 10 threads / rank and 10 task queues / rank (dt_min=1.000e-10, dt_max=1.000e-02)...

Having said all of that, I do agree with your reading of the problem. I'm not 100% familiar with all the places where VR sets these extra properties, but there's clearly something missing that needs to be addressed. I'm hoping that I can first reproduce this locally to then actually try to tackle the problem.

@jchelly
Copy link
Author

jchelly commented Aug 12, 2020

Sorry, I forgot that I had to add some entries to the swift .yml file to make it call velociraptor. Here are the swift config file I used eagle_12.yml.gz and my velociraptor configuration vrconfig_3dfof_subhalos_SO_hydro.cfg.gz.

Swift should run for about 250 timesteps then try to run Velociraptor. The output at that point looks like this:

     247   1.277447e-02    0.9099957    0.0989063   2.045724e-08   40   42          497          497            0            0               127.712      0
     248   1.277449e-02    0.9099970    0.0989047   2.045724e-08   40   40           18           18            0            0                91.229      0
     249   1.277451e-02    0.9099983    0.0989031   2.045724e-08   40   41           93           93            0            0                93.190      0
VELOCIraptor/STF running with MPI. Number of mpi threads: 2
VELOCIraptor/STF running with OpenMP. Number of openmp threads: 16
Copying particle data...
Copying particle data...
1 Finished copying particle data.
1 took 3.64526 to copy 6307816 particles from SWIFT to a local format. Out of 13289738

Swift uses pthreads for multithreading so OMP_NUM_THREADS will only affect the Velociraptor call. The number of threads to use in swift is set by the command line --threads option.

@rtobar
Copy link

rtobar commented Aug 12, 2020

@jchelly thanks for the configuration file, I'll give it a try.

Regarding threads, I also understood that swift uses pthreads in a pool, but I also saw a couple of omp pragmas scattered around the code so I decided to give it a try. At the end it was just a matter of mpirun not giving me access to all CPUs in my code by default, so had to be explicit about it (i.e., --cpus-per-rank 10).

@rtobar
Copy link

rtobar commented Aug 13, 2020

@jchelly OK, much better now, I managed to reproduce the issue in about 2 or 3 minutes, thanks a lot again for those files. The line number in my version of the code is slightly different, but I'm assuming we are seeing only slightly different versions of the code (I'm using c1283dd).

If I understand correctly, there are missing bits both in Swift and in VR. In VR it would be:

  • When swift_{gas,bh,star}_parts are not NULL, their data has to be copied to the HydroProperties of the corresponding particle. A the moment an empty HydroProperties is passed instead.
  • When swift_{gas,bh,star}_parts are NULL the code should behave properly. At the moment this was not considered, and results in a crash.

I have already a patch for the second point (see #14) which I tested locally and gets me past the original crash. There was a second place were no check was done before using these optional extra properties, so I added checks to those as well, which finally lets VR finish its pass and swift continue. I'll seek some input into whether the changes make sense from a scientific point of view, but at least purely from the code perspective I'd say they are correct.

We can address the first point separately. Is there a corresponding set of changes in swift that address this?

@jchelly
Copy link
Author

jchelly commented Aug 13, 2020

I couldn't find any likely looking branches in the Swift repository so I suspect the changes just haven't been made. I'll ask on slack.

@rtobar
Copy link

rtobar commented Sep 15, 2020

The immediate problem reported in this ticket (a crash caused by a nullptr dereference) has been fixed, so this issue can be closed. Properly copying non-NULL swift_{gas,bh,star}_parts into VR can be tackled in a different issue once swift is ready to do this.

@rtobar rtobar closed this as completed Sep 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants