Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent memory allocation is slowed by memTrack #929

Closed
ronawho opened this issue Sep 27, 2021 · 6 comments
Closed

Concurrent memory allocation is slowed by memTrack #929

ronawho opened this issue Sep 27, 2021 · 6 comments
Assignees
Labels
performance Performance needs improving

Comments

@ronawho
Copy link
Contributor

ronawho commented Sep 27, 2021

Discussed a little in #154, but --memTrack can seriously hurt the performance of concurrent allocations. Historically, most of Arkouda's operations allocate a few large arrays and relatively little beyond that. However, there were a few times recently where allocations were really hurting performance. The allocations were short lived so a performance hit isn't surprising, but I was surprised by the magnitude of the performance hit and I'm only now putting together that a lot of the overhead is from --memTrack serializing concurrent allocations.

Here's a concurrent allocation micro-benchmark that demonstrates the overhead. Results are on 128-core Rome CPU:

use Time;
config const trials = 1_000_000;

var t: Timer; t.start();
coforall 1..here.maxTaskPar do
  for i in 1..trials do
    var s = i:string;
writeln(t.elapsed());
config Time
w/o memTrack 0.19s
w/ memTrack 144.50s

In most cases we'd like to optimize out as much allocation as possible (#266, #917), but in other cases getting rid of the concurrent allocation is hard (#675) and in general I think we want to avoid this large performance pitfall.

My suggestion is to run with --memThreshold to limit what allocations are tracked with the thought that highly concurrent allocations will tend to be small. In trying this I found the --memThreshold did not get rid of as much overhead as I was expecting, but I believe I have addressed that with chapel-lang/chapel#18465. Here are updated performance results with --memThreshold before and after that PR:

config Time
w/o memTrack 0.19s
w/ memTrack 144.50s
w/ threshold before 33.06s
w/ threshold now 0.22s
@ronawho ronawho added the performance Performance needs improving label Sep 27, 2021
@ronawho ronawho self-assigned this Sep 27, 2021
@ronawho
Copy link
Contributor Author

ronawho commented Sep 27, 2021

FYI @reuster986 @mhmerrill @pierce314159. This is large part of the reason operations that were doing lots of short-lived concurrent allocations were so slow (initial SipHash, current string regex, cast-to-str.) I'm not really sure how this didn't occur to me before. In hindsight it's pretty obvious and I kept being surprised how much overhead allocation added, but I guess I kept chalking it up to the short-lived nature of the operations

There is still a non-trivial overhead to these allocations so optimizing them out is beneficial, but using --memThreshold should significantly reduce the overhead for cases we can't. If it's ok with you, my plan is to set the threshold at around 1MB or so. Anything above 1K will probably be enough, but I figure 1MB large enough that aggregation buffers won't be impacted either -- those are long lived allocations so it shouldn't matter, but it can't hurt.

@reuster986
Copy link
Collaborator

@ronawho Good find! I am fine with a 1MB threshold, since the intent is really to track allocations on the scale of GB or larger anyway.

@ronawho
Copy link
Contributor Author

ronawho commented Oct 4, 2021

Using a threshold has significantly sped up operations with a lot of concurrent allocations. Casting to str for example is 10x faster with chapel 1.25.0 and 100x faster with the main chapel development branch.

This has also helped a bunch of other operations: https://chapel-lang.org/perf/arkouda/16-node-cs-hdr/?startdate=2021/08/28&enddate=2021/10/04&configs=release,nightly&graphs=coargsortperformance,groupbyperformance,stringgroupbyperformance (jump on the 10/01)

I'm wondering if this performance benefit on main is something we should be considering a 1.25.1 release for. Point releases can be a bit of a pain, but the performance benefits here can be quite significant (and we could roll in other changes like chapel regex optimizations mentioned in #917 (comment))

@reuster986
Copy link
Collaborator

I think these performance gains are pretty significant, so I would be in favor of a 1.25.1, if it's not too much trouble. Are there any other changes coming up that it would make sense to wait for?

@ronawho
Copy link
Contributor Author

ronawho commented Oct 4, 2021

Yeah, reducing allocations in chapel regex code, and possibly some scan scalability improvements are coming up.

I also think it would be worthwhile to test out 1.25.0 to make sure there's no correctness bugs first and if there are we could address them then too. I'll get the conversation started about a 1.25.1 internally.

@ronawho
Copy link
Contributor Author

ronawho commented Jan 19, 2022

Since #1027, arkouda recommends Chapel 1.25.1, which includes the memory tracking optimization and there's minimal overhead from --memTrack now.

@ronawho ronawho closed this as completed Jan 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance needs improving
Projects
None yet
Development

No branches or pull requests

2 participants