Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPX performance degrades with time since execution begins #1753

Closed
dmarce1 opened this issue Sep 17, 2015 · 34 comments
Closed

HPX performance degrades with time since execution begins #1753

dmarce1 opened this issue Sep 17, 2015 · 34 comments

Comments

@dmarce1
Copy link
Member

dmarce1 commented Sep 17, 2015

In the runs where octotiger runs for a bit before crashing or locking up, I am noticing that the performance gets worse as time progresses.

Example: http://s7.postimg.org/5x1kah1wb/degrade.png

This seems to be the case for every run I have looked at, each with 100 or 128 processors on SuperMIC. There is no computational reason that should be happening - the number of floating point operations required for each time-step is nearly constant.

@sithhell
Copy link
Member

How do I run the code with a different number of timesteps?

@neutrino504
Copy link

The code executes forever until it crashes. Do you want to set a maximum number? The main loop is in main.cpp. Have you been able to execute it at all? You can change the maximum refinement level with MAX_LEVEL in defs.hpp

@sithhell
Copy link
Member

Am 22.09.2015 5:14 nachm. schrieb "Neutrino" notifications@github.com:

The code executes forever until it crashes. Do you want to set a maximum
number? The main loop is in main.cpp. Have you been able to execute it at
all? You can change the maximum refinement level with MAX_LEVEL in defs.hpp

Yes, im able to build and run the code. I was just wondering how you
obtained the numbers for the different time steps.

@sithhell
Copy link
Member

@dmarce1, how do I get the output how long a single timestep took?

@dmarce1
Copy link
Member Author

dmarce1 commented Sep 24, 2015

There should be a line of output for every timestep showing

step number - current simulation time - simulation timestep size - wallclock time for timestep

These same numbers will be in "step.dat"

@sithhell
Copy link
Member

I was able to succesfully run your code with #1772 now. There were bunch of problems in your code as well. I'll open a PR ASAP (see #1752 for more information). What you can see is that I can't reproduce your regression in the wallclock time of the timesteps:
step

However, plotting dt, gives a similar graph than what you showed:
dt

@dmarce1
Copy link
Member Author

dmarce1 commented Sep 29, 2015

I think if you replot the wall clock time plot with the yrange set to something like [1.5:2.0] you'll get a similar picture to what I got.

@sithhell
Copy link
Member

You are most certainly correct. The yscale did make a misleading impression. So we see a slowdown of a factor of roughly 1.4. To give a better impression, I plotted the relative slowdown (omitting the peaks where the checkpoint data is being written:
decrease

So we are getting down to about 70% of the performance... Does this have to be caused by HPX or could this also be caused by the algorithm used (work imbalances etc.)?

@hkaiser
Copy link
Member

hkaiser commented Sep 29, 2015

We're currently investigating this issue in another application and we think to have some hints on what's causing this. I hope to have more information available soon.

@hkaiser
Copy link
Member

hkaiser commented Oct 1, 2015

This should be fixed once #1775 is merged.

@hkaiser
Copy link
Member

hkaiser commented Oct 24, 2015

@dmarce1 #1811 fixed another set of problems causing slowdown. Please report back if this issue can be closed now

@hkaiser
Copy link
Member

hkaiser commented Nov 5, 2015

@dmarce1 Can this be closed now? From what I understand these issues should have been solved by now.

@hkaiser
Copy link
Member

hkaiser commented Nov 7, 2015

I'm closing this now, please reopen, if appropriate.

@dmarce1
Copy link
Member Author

dmarce1 commented Dec 21, 2015

I am still getting performance degradation over time. Here is a plot:

http://s17.postimg.org/6h8qkbvn3/performance.png

This is with 64 nodes on SuperMIC

This is with a version of octotiger that doesn't rely on serialization of std::array, dmarce1/octotiger@7214efe

@dmarce1
Copy link
Member Author

dmarce1 commented Dec 22, 2015

Here's a longer plot:

performance

The little spikes aren't a problem, those are steps where octotiger outputs to the file system. The wall clock time for non output steps increases linearly to a point and then levels off. There are small changes in the total load over the run as AMR refinement grids are created and destroyed, but this is less than +/- 10% of the starting load - not nearly enough to account for this.

Could this be a problem in octo-tiger itself and if so what would it look like?

On a positive note - this run represents the longest I have ever run octo-tiger using distributed HPX without a crash or freezeup.

@sithhell
Copy link
Member

Is this with a release build?
Which column represents the wall clock time of a step?

@dmarce1
Copy link
Member Author

dmarce1 commented Dec 22, 2015

This is with a debug build. Each data point is one timestep, with the y-axis being the wall clock time for that timestep in seconds, and the x-axis being the number of the timestep since the start of the evolution.

Its as if there is some resource not being freed.

@sithhell
Copy link
Member

This is with a debug build.
Each data point is one timestep, with the y-axis being the wall clock time for that timestep in seconds, and the x-axis being the number of the timestep since the start of the evolution.

Sorry, i meant the output in octotiger...

@dmarce1
Copy link
Member Author

dmarce1 commented Dec 22, 2015

My bad - Wall clock time is the 4th column in step.dat. The timestep number is the 1st column

Whatever is causing the issue doesn't appear to be taking up much memory. I started the run over and I've been tracking how much memory is used and it hasn't changed much while the wall clock time has already gone up to ~10 x the original step size.

@hkaiser
Copy link
Member

hkaiser commented Dec 22, 2015

Frankly, I wouldn't draw any conclusions from performance measurements done in debug-builds.
HPX is doing a lot of book-keeping and other additional things when compiled in debug-mode which might cause the slowdown you're seeing. Could you please verify whether the slowdown is still happening when run in release-mode?

@dmarce1
Copy link
Member Author

dmarce1 commented Dec 22, 2015

I am running in release mode now and still getting the same problem.

@dmarce1
Copy link
Member Author

dmarce1 commented Jan 4, 2016

Update-

I have been looking at performance counters and how they are changing with execution time for one timestep. I am resetting the counters with each time-step. This is what I have found:

Counters with a normalized increase over time similar to the normalized increase over time for wall clock per time-step:

/threads/count/cumulative-phases
/threads/count/pending-accesses
/threads/count/pending-misses

Counters that increase over time but at a slower rate
/threads/count/cumulative
/threads/count/stack-recycles
/threads/count/stack-unbinds
/threads/count/stolen-from-pending

Counters that do not change much over time:
/threads/count/instantaneous/
/threads/count/objects

Counters that increase a little bit at first, then come back down, then go back up again
/threads/count/stolen-from-staged
/threads/count/stolen-to-staged

I have not tested any other counters. Can anyone suggest which might be the most useful?

@hkaiser
Copy link
Member

hkaiser commented Jan 4, 2016

When using the plot script I linked above there wouldn't be any need to reset counters (resetting might be tricky as it is a 'global' operation which does not take into account computational phases).

For other counters you might want to look at: I'd suggest looking at action invocation counts:

/runtime/count/action_invocation@*
/runtime/count/remote_action_invocation@*

which count all (local and remote) action invocations individually. Note that remote action invocations are counted as local as well (on the locality where they are run)

Also, can you provide a link to those graphs you described above? The increase in increments for those counters makes sense as all of them are related to thread execution. Everything hints at an overall increase of some operation counts. The action counters above might give us a better (higher level) insight.

@dmarce1
Copy link
Member Author

dmarce1 commented Jan 5, 2016

I have found evidence of the culprit. When I turn off agas caching, the initial performance is of course a little worse - but it does not degrade with time.

BTW - I had some trouble getting the python script to run at first, so above I was just looking at the output csv with gnuplot. I'm able to get the script to run partially now, except it does not print all of the counters that are output for some reason.

@dmarce1
Copy link
Member Author

dmarce1 commented Jan 5, 2016

Attached is a zip file with results from the agas cache performance counters when agas caching is turned "on".

plots.zip

Here is what appears to me to be the only plot of note:

agas_time_cache_get_entry

The relative increase in time it takes for "get_entry" to run is roughly proportional to the increase in wall clock per timestep over that same time period.
step

@hkaiser
Copy link
Member

hkaiser commented Jan 6, 2016

The first plot makes total sense, as the counter /agas/time/cache/get_entry is cumulative. I'd expect for it to go up linearly. Were you resetting counters while gathering that data?

@dmarce1
Copy link
Member Author

dmarce1 commented Jan 6, 2016

Yes - I am resetting the counters every time-step. The plot should be roughly horizontal in that case, no?

@hkaiser
Copy link
Member

hkaiser commented Jan 6, 2016

Could you rerun without resetting counters, please? I'd like to exclude the possibility of problems related to that.

@dmarce1
Copy link
Member Author

dmarce1 commented Jan 7, 2016

Here is the re-run without re-setting. Note I am using 'large' stacks, and output is turned on (it is turned off above). This is why there are spikes in the 'diff' plot.

agas_time_cache_get_entry_diff
agas_time_cache_get_entry

@hkaiser
Copy link
Member

hkaiser commented Jan 9, 2016

I think I know what's going on. @dmarce1 How many overall localities/cores were you running on for the latest graph above?

@dmarce1
Copy link
Member Author

dmarce1 commented Jan 11, 2016

64 nodes, 20 cores a node, 1 locality per node, 20 threads a locality.

@dmarce1
Copy link
Member Author

dmarce1 commented Jan 22, 2016

Is someone working on this?

@hkaiser
Copy link
Member

hkaiser commented Jan 23, 2016

Yes, I think I have a solution, working on it.

@hkaiser
Copy link
Member

hkaiser commented Feb 22, 2016

@dmarce1 reported today that he believes for the slowdown to have been fixed.

@hkaiser hkaiser closed this as completed Feb 22, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants