-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Errors on large(r)-scale runs #10
Comments
Update: the same error occurs when using 64bit integers for |
You are running a problem just a bit larger than the largest we ever have run and are triggering a stop condition. I'm not enough of a physicist to know what is going on so I'm going to have to find one if what I suspect is happening is the case. My hunch is the issue is related to overall problem size, which is causing individual zones to be very small. You are running about 33 trillion zones in total and the largest we have run in the past is in the 4-8 trillion range that I know of. The result is you are barely triggering the m_qstop(Real_t(1.0e+12)) condition. To see if this is the problem I would suggest increasing that value, which is set in LULESH_init. If this works I'll go find out what triggering this means in detail. It could mean that the zones you are using are too small to be realistic, i.e. the approximations being made no longer hold. There are other ways around this though, such as making the modeled geometry larger. Note LULESH is using relative simple physics and does not use much memory per zone so you can run a large number of zones. Production simulations use more memory per zone or have time to solution bottlenecks that typically result in using fewer zones. That said there could be a good reason for pushing this further and I'm interested in hearing what you are using it for. |
@ikarlin Thanks for the clarifications. I'm not a physicist either so I didn't think about the way problem scaling is done in Lulesh. In fact, since the problem size definition is tailored towards weak scaling (controlling number of elements per process) I was assuming that Lulesh would just do the right thing. Is there a way to scale the modeled geometry while scaling the problem size? What are realistic configurations of Lulesh to scale to 1k or 2k processes? I am porting Lulesh to a task-based programming model (the paper is currently under review) and wanted to use it to demonstrate scalability with a well-established proxy app. Larger problem sizes mean more work per task, which is better suited to hide the task management overhead (and exploit effects such as cache-reuse). Another reason was that the memory requirement of |
@devreal You're welcome. There are a few possible solutions to getting what you want done. First, the suggested problem size for LULESH is 1,000 - 100,000 elements per GB of memory. At the high end for 2K processes this would be about 26 trillion elements so this might just work. This represents a "simple physics" example, but will leave a significant amount of memory empty. Since LULESH is even simpler than simple real world physics simulations it uses less memory per zone so that is why you were seeing significant memory left over with S=200. At the low end of zones this is representative of multi-physics simulations where another physics might use most of the memory. As an aside I'm confident this is not a number of process issue as we have run LULESH to millions of processes, but they were much smaller in terms of overall zone count so that is why I think that zone size is the issue. That said if you need to run larger problems in terms of number of elements you can either try changing the qstop or in Domain::BuildMesh you can make the mesh larger by changing the 1.125 parameters to a larger number. The first change will work around the problem you are having, but might break the code due to roundoff issues or other things. The second change will make the domain larger and each cell larger. This should solve the problem and should not break the code. The shock wave will still stop at 1, but that should be OK and its not likely at that number of zones you would run long enough to get there anyway. You are right the code is set to weak scale and should weak scale very well. Let me know what you decide on and if I can help out more. I'm also interested in your paper when you get it published. |
@ikarlin Sorry for the delay, it took me a while to run the necessary test jobs. I tried increasing that magic number in For now I will run with this and investigate further how I can cover smaller problem sizes. It certainly seems relevant for fully covering Lulesh. An application with more complex physics (and thus larger tasks in my case) may be an easier target. I will send you the paper once it gets accepted. Thanks a lot for your help so far :) |
@devreal sorry for the slow reply. I will close this issue and if something new pops up just let me know. |
I am trying to run Lulesh on larger scales on a Cray XC40 (2x12C Haswell, one process per node, 24 OpenMP threads) using the Intel 18.0.1 compiler, but run into the following error at
s=400
on >=512 processes:The
ERROR
line was added by me to track down where the abort happens and why. I tried different combinations of Cray MPICH and Open MPI (3.1.2) and compiling with-O2
or-O3
, all showing the same behavior. Interestingly, smaller runs (such as withs=400
and 256 processes succeed as do runs withs=300
at 512 processes.) The error occurs both in the latest git commit and the 2.0.3 release. Any idea what might be causing this problem? I don't think that I am running into an integer overflow (see #7) but I might be wrong (8*(400**3) = 512000000
is still well within the bounds of 32bit integers).I will try to run with 64bit integers nevertheless, just to make sure (it just takes some time on that machine). In general, are such larger runs supported by Lulesh?
The text was updated successfully, but these errors were encountered: