Concurrent memory allocation is slowed by `memTrack` #929

ronawho · 2021-09-27T13:39:42Z

Discussed a little in #154, but --memTrack can seriously hurt the performance of concurrent allocations. Historically, most of Arkouda's operations allocate a few large arrays and relatively little beyond that. However, there were a few times recently where allocations were really hurting performance. The allocations were short lived so a performance hit isn't surprising, but I was surprised by the magnitude of the performance hit and I'm only now putting together that a lot of the overhead is from --memTrack serializing concurrent allocations.

Here's a concurrent allocation micro-benchmark that demonstrates the overhead. Results are on 128-core Rome CPU:

use Time;
config const trials = 1_000_000;

var t: Timer; t.start();
coforall 1..here.maxTaskPar do
  for i in 1..trials do
    var s = i:string;
writeln(t.elapsed());

config	Time
w/o memTrack	0.19s
w/ memTrack	144.50s

In most cases we'd like to optimize out as much allocation as possible (#266, #917), but in other cases getting rid of the concurrent allocation is hard (#675) and in general I think we want to avoid this large performance pitfall.

My suggestion is to run with --memThreshold to limit what allocations are tracked with the thought that highly concurrent allocations will tend to be small. In trying this I found the --memThreshold did not get rid of as much overhead as I was expecting, but I believe I have addressed that with chapel-lang/chapel#18465. Here are updated performance results with --memThreshold before and after that PR:

config	Time
w/o memTrack	0.19s
w/ memTrack	144.50s
w/ threshold before	33.06s
w/ threshold now	0.22s

The text was updated successfully, but these errors were encountered:

ronawho · 2021-09-27T13:51:50Z

FYI @reuster986 @mhmerrill @pierce314159. This is large part of the reason operations that were doing lots of short-lived concurrent allocations were so slow (initial SipHash, current string regex, cast-to-str.) I'm not really sure how this didn't occur to me before. In hindsight it's pretty obvious and I kept being surprised how much overhead allocation added, but I guess I kept chalking it up to the short-lived nature of the operations

There is still a non-trivial overhead to these allocations so optimizing them out is beneficial, but using --memThreshold should significantly reduce the overhead for cases we can't. If it's ok with you, my plan is to set the threshold at around 1MB or so. Anything above 1K will probably be enough, but I figure 1MB large enough that aggregation buffers won't be impacted either -- those are long lived allocations so it shouldn't matter, but it can't hurt.

reuster986 · 2021-09-27T15:30:27Z

@ronawho Good find! I am fine with a 1MB threshold, since the intent is really to track allocations on the scale of GB or larger anyway.

ronawho · 2021-10-04T15:11:47Z

Using a threshold has significantly sped up operations with a lot of concurrent allocations. Casting to str for example is 10x faster with chapel 1.25.0 and 100x faster with the main chapel development branch.

This has also helped a bunch of other operations: https://chapel-lang.org/perf/arkouda/16-node-cs-hdr/?startdate=2021/08/28&enddate=2021/10/04&configs=release,nightly&graphs=coargsortperformance,groupbyperformance,stringgroupbyperformance (jump on the 10/01)

I'm wondering if this performance benefit on main is something we should be considering a 1.25.1 release for. Point releases can be a bit of a pain, but the performance benefits here can be quite significant (and we could roll in other changes like chapel regex optimizations mentioned in #917 (comment))

reuster986 · 2021-10-04T17:55:16Z

I think these performance gains are pretty significant, so I would be in favor of a 1.25.1, if it's not too much trouble. Are there any other changes coming up that it would make sense to wait for?

ronawho · 2021-10-04T18:09:53Z

Yeah, reducing allocations in chapel regex code, and possibly some scan scalability improvements are coming up.

I also think it would be worthwhile to test out 1.25.0 to make sure there's no correctness bugs first and if there are we could address them then too. I'll get the conversation started about a 1.25.1 internally.

ronawho · 2022-01-19T00:38:39Z

Since #1027, arkouda recommends Chapel 1.25.1, which includes the memory tracking optimization and there's minimal overhead from --memTrack now.

ronawho added the performance Performance needs improving label Sep 27, 2021

ronawho self-assigned this Sep 27, 2021

ronawho mentioned this issue Sep 27, 2021

Casting to str is slow #675

Closed

ronawho mentioned this issue Sep 29, 2021

Avoid tracking small allocations by setting memThreshold #935

Merged

ronawho closed this as completed Jan 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent memory allocation is slowed by `memTrack` #929

Concurrent memory allocation is slowed by `memTrack` #929

ronawho commented Sep 27, 2021

ronawho commented Sep 27, 2021

reuster986 commented Sep 27, 2021

ronawho commented Oct 4, 2021

reuster986 commented Oct 4, 2021

ronawho commented Oct 4, 2021

ronawho commented Jan 19, 2022

Concurrent memory allocation is slowed by memTrack #929

Concurrent memory allocation is slowed by memTrack #929

Comments

ronawho commented Sep 27, 2021

ronawho commented Sep 27, 2021

reuster986 commented Sep 27, 2021

ronawho commented Oct 4, 2021

reuster986 commented Oct 4, 2021

ronawho commented Oct 4, 2021

ronawho commented Jan 19, 2022

Concurrent memory allocation is slowed by `memTrack` #929

Concurrent memory allocation is slowed by `memTrack` #929