Errors on large(r)-scale runs #10

devreal · 2018-12-27T22:34:22Z

I am trying to run Lulesh on larger scales on a Cray XC40 (2x12C Haswell, one process per node, 24 OpenMP threads) using the Intel 18.0.1 compiler, but run into the following error at s=400 on >=512 processes:

mpirun -n 512 -N 1 -bind-to none /zhome/academic/HLRS/hlrs/hpcjschu/src/LULESH-2.0.3/lulesh_mpi -s 400 -i 100 -p -b 0

Num threads: 24
Total number of elements: 32768000000

To run other sizes, use -s <integer>.
To run a fixed number of iterations, use -i <integer>.
To run a more or less balanced region set, use -b <integer>.
To change the relative costs of regions, use -c <integer>.
To print out progress, use -p
To write an output file for VisIt, use -v
See help (-h) for more options

cycle = 1, time = 3.298540e-11, dt=3.298540e-11
cycle = 2, time = 7.256788e-11, dt=3.958248e-11
cycle = 3, time = 8.613524e-11, dt=1.356736e-11
cycle = 4, time = 9.746505e-11, dt=1.132980e-11
cycle = 5, time = 1.075651e-10, dt=1.010008e-11
cycle = 6, time = 1.169435e-10, dt=9.378386e-12
cycle = 7, time = 1.258599e-10, dt=8.916427e-12
ERROR: domain.q(1) = 1026443320729.883911
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -2.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

The ERROR line was added by me to track down where the abort happens and why. I tried different combinations of Cray MPICH and Open MPI (3.1.2) and compiling with -O2 or -O3, all showing the same behavior. Interestingly, smaller runs (such as with s=400 and 256 processes succeed as do runs with s=300 at 512 processes.) The error occurs both in the latest git commit and the 2.0.3 release. Any idea what might be causing this problem? I don't think that I am running into an integer overflow (see #7) but I might be wrong (8*(400**3) = 512000000 is still well within the bounds of 32bit integers).

I will try to run with 64bit integers nevertheless, just to make sure (it just takes some time on that machine). In general, are such larger runs supported by Lulesh?

The text was updated successfully, but these errors were encountered:

devreal · 2019-01-02T03:34:46Z

Update: the same error occurs when using 64bit integers for Index_t and Int_t instead of the default 32bit integers. Any other ideas on what could be going wrong there?

ikarlin · 2019-01-08T02:57:54Z

You are running a problem just a bit larger than the largest we ever have run and are triggering a stop condition. I'm not enough of a physicist to know what is going on so I'm going to have to find one if what I suspect is happening is the case.

My hunch is the issue is related to overall problem size, which is causing individual zones to be very small. You are running about 33 trillion zones in total and the largest we have run in the past is in the 4-8 trillion range that I know of. The result is you are barely triggering the m_qstop(Real_t(1.0e+12)) condition.

To see if this is the problem I would suggest increasing that value, which is set in LULESH_init. If this works I'll go find out what triggering this means in detail. It could mean that the zones you are using are too small to be realistic, i.e. the approximations being made no longer hold. There are other ways around this though, such as making the modeled geometry larger.

Note LULESH is using relative simple physics and does not use much memory per zone so you can run a large number of zones. Production simulations use more memory per zone or have time to solution bottlenecks that typically result in using fewer zones. That said there could be a good reason for pushing this further and I'm interested in hearing what you are using it for.

devreal · 2019-01-08T15:35:48Z

@ikarlin Thanks for the clarifications. I'm not a physicist either so I didn't think about the way problem scaling is done in Lulesh. In fact, since the problem size definition is tailored towards weak scaling (controlling number of elements per process) I was assuming that Lulesh would just do the right thing. Is there a way to scale the modeled geometry while scaling the problem size? What are realistic configurations of Lulesh to scale to 1k or 2k processes?

I am porting Lulesh to a task-based programming model (the paper is currently under review) and wanted to use it to demonstrate scalability with a well-established proxy app. Larger problem sizes mean more work per task, which is better suited to hide the task management overhead (and exploit effects such as cache-reuse). Another reason was that the memory requirement of s=200 is way below the capabilities of contemporary machines so scaling the per-node problem size seemed natural, too. However, if the scales I chose are simply not realistic or representative I will go back to see how I can deal with smaller scales. The larger scales were simply an easier pick :)

ikarlin · 2019-01-09T02:05:57Z

@devreal You're welcome. There are a few possible solutions to getting what you want done. First, the suggested problem size for LULESH is 1,000 - 100,000 elements per GB of memory. At the high end for 2K processes this would be about 26 trillion elements so this might just work. This represents a "simple physics" example, but will leave a significant amount of memory empty. Since LULESH is even simpler than simple real world physics simulations it uses less memory per zone so that is why you were seeing significant memory left over with S=200. At the low end of zones this is representative of multi-physics simulations where another physics might use most of the memory.

As an aside I'm confident this is not a number of process issue as we have run LULESH to millions of processes, but they were much smaller in terms of overall zone count so that is why I think that zone size is the issue.

That said if you need to run larger problems in terms of number of elements you can either try changing the qstop or in Domain::BuildMesh you can make the mesh larger by changing the 1.125 parameters to a larger number. The first change will work around the problem you are having, but might break the code due to roundoff issues or other things. The second change will make the domain larger and each cell larger. This should solve the problem and should not break the code. The shock wave will still stop at 1, but that should be OK and its not likely at that number of zones you would run long enough to get there anyway.

You are right the code is set to weak scale and should weak scale very well. Let me know what you decide on and if I can help out more. I'm also interested in your paper when you get it published.

devreal · 2019-02-07T09:56:48Z

@ikarlin Sorry for the delay, it took me a while to run the necessary test jobs. I tried increasing that magic number in Domain::BuildMesh (tested up to 100) but that did not yield any success. Increasing the qstop to 1e+14 allowed me to get a 512 node run with s=400 and I expect it to get me to 1k nodes. I did not see any other errors in that run but I cannot judge the correctness of the physics (in case the larger qstop impairs that).

For now I will run with this and investigate further how I can cover smaller problem sizes. It certainly seems relevant for fully covering Lulesh. An application with more complex physics (and thus larger tasks in my case) may be an easier target. I will send you the paper once it gets accepted. Thanks a lot for your help so far :)

ikarlin · 2019-03-27T21:27:14Z

@devreal sorry for the slow reply. I will close this issue and if something new pops up just let me know.

ikarlin closed this as completed Mar 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Errors on large(r)-scale runs #10

Errors on large(r)-scale runs #10

devreal commented Dec 27, 2018

devreal commented Jan 2, 2019

ikarlin commented Jan 8, 2019

devreal commented Jan 8, 2019

ikarlin commented Jan 9, 2019

devreal commented Feb 7, 2019

ikarlin commented Mar 27, 2019

Errors on large(r)-scale runs #10

Errors on large(r)-scale runs #10

Comments

devreal commented Dec 27, 2018

devreal commented Jan 2, 2019

ikarlin commented Jan 8, 2019

devreal commented Jan 8, 2019

ikarlin commented Jan 9, 2019

devreal commented Feb 7, 2019

ikarlin commented Mar 27, 2019