New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPX performance degrades with time since execution begins #1753
Comments
How do I run the code with a different number of timesteps? |
The code executes forever until it crashes. Do you want to set a maximum number? The main loop is in main.cpp. Have you been able to execute it at all? You can change the maximum refinement level with MAX_LEVEL in defs.hpp |
Am 22.09.2015 5:14 nachm. schrieb "Neutrino" notifications@github.com:
Yes, im able to build and run the code. I was just wondering how you |
@dmarce1, how do I get the output how long a single timestep took? |
There should be a line of output for every timestep showing step number - current simulation time - simulation timestep size - wallclock time for timestep These same numbers will be in "step.dat" |
I was able to succesfully run your code with #1772 now. There were bunch of problems in your code as well. I'll open a PR ASAP (see #1752 for more information). What you can see is that I can't reproduce your regression in the wallclock time of the timesteps: However, plotting dt, gives a similar graph than what you showed: |
I think if you replot the wall clock time plot with the yrange set to something like [1.5:2.0] you'll get a similar picture to what I got. |
We're currently investigating this issue in another application and we think to have some hints on what's causing this. I hope to have more information available soon. |
This should be fixed once #1775 is merged. |
@dmarce1 Can this be closed now? From what I understand these issues should have been solved by now. |
I'm closing this now, please reopen, if appropriate. |
I am still getting performance degradation over time. Here is a plot: http://s17.postimg.org/6h8qkbvn3/performance.png This is with 64 nodes on SuperMIC This is with a version of octotiger that doesn't rely on serialization of std::array, dmarce1/octotiger@7214efe |
Here's a longer plot: The little spikes aren't a problem, those are steps where octotiger outputs to the file system. The wall clock time for non output steps increases linearly to a point and then levels off. There are small changes in the total load over the run as AMR refinement grids are created and destroyed, but this is less than +/- 10% of the starting load - not nearly enough to account for this. Could this be a problem in octo-tiger itself and if so what would it look like? On a positive note - this run represents the longest I have ever run octo-tiger using distributed HPX without a crash or freezeup. |
Is this with a release build? |
This is with a debug build. Each data point is one timestep, with the y-axis being the wall clock time for that timestep in seconds, and the x-axis being the number of the timestep since the start of the evolution. Its as if there is some resource not being freed. |
Sorry, i meant the output in octotiger... |
My bad - Wall clock time is the 4th column in step.dat. The timestep number is the 1st column Whatever is causing the issue doesn't appear to be taking up much memory. I started the run over and I've been tracking how much memory is used and it hasn't changed much while the wall clock time has already gone up to ~10 x the original step size. |
Frankly, I wouldn't draw any conclusions from performance measurements done in debug-builds. |
I am running in release mode now and still getting the same problem. |
Update- I have been looking at performance counters and how they are changing with execution time for one timestep. I am resetting the counters with each time-step. This is what I have found: Counters with a normalized increase over time similar to the normalized increase over time for wall clock per time-step: /threads/count/cumulative-phases Counters that increase over time but at a slower rate Counters that do not change much over time: Counters that increase a little bit at first, then come back down, then go back up again I have not tested any other counters. Can anyone suggest which might be the most useful? |
When using the plot script I linked above there wouldn't be any need to reset counters (resetting might be tricky as it is a 'global' operation which does not take into account computational phases). For other counters you might want to look at: I'd suggest looking at action invocation counts:
which count all (local and remote) action invocations individually. Note that remote action invocations are counted as local as well (on the locality where they are run) Also, can you provide a link to those graphs you described above? The increase in increments for those counters makes sense as all of them are related to thread execution. Everything hints at an overall increase of some operation counts. The action counters above might give us a better (higher level) insight. |
I have found evidence of the culprit. When I turn off agas caching, the initial performance is of course a little worse - but it does not degrade with time. BTW - I had some trouble getting the python script to run at first, so above I was just looking at the output csv with gnuplot. I'm able to get the script to run partially now, except it does not print all of the counters that are output for some reason. |
The first plot makes total sense, as the counter |
Yes - I am resetting the counters every time-step. The plot should be roughly horizontal in that case, no? |
Could you rerun without resetting counters, please? I'd like to exclude the possibility of problems related to that. |
I think I know what's going on. @dmarce1 How many overall localities/cores were you running on for the latest graph above? |
64 nodes, 20 cores a node, 1 locality per node, 20 threads a locality. |
Is someone working on this? |
Yes, I think I have a solution, working on it. |
@dmarce1 reported today that he believes for the slowdown to have been fixed. |
In the runs where octotiger runs for a bit before crashing or locking up, I am noticing that the performance gets worse as time progresses.
Example: http://s7.postimg.org/5x1kah1wb/degrade.png
This seems to be the case for every run I have looked at, each with 100 or 128 processors on SuperMIC. There is no computational reason that should be happening - the number of floating point operations required for each time-step is nearly constant.
The text was updated successfully, but these errors were encountered: