-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SipHash underperforming #266
Comments
@LattigLT can you share the test/benchmark you were using to compare performance? |
@ronawho I believe the Chapel benchmark was https://github.com/mhmerrill/arkouda/blob/master/test/SipHashSpeedTest.chpl I haven't personally seen the python benchmark code, but from what he described, I believe it was something like: x = list(range(10**6))
start = time()
s = set(x)
t = time() - start |
@ronawho here is the GitHub project for SipHash https://github.com/veorq/SipHash |
@reuster986 Thanks, I'm using SipHashSpeedTest.chpl to start investigating. Do you have a rate you're expecting/hoping for. I see ~10MB/s per core, but I have no sense for whether that's good or bad. edit: Oh, and do you have a reference code that you ported the Chapel code from? |
See here for a python test. Here's what I get on my laptop (1.87GHz haswell): $ tests/python_hash_rate.py --NINPUTS=1_000_000 --INPUTSIZE=8
Hashed 1000000 blocks (8000000 bytes) in 0.136 seconds
Rate = 55.93 MB/s As opposed to the Chapel version (using single core for comparison): $ test-bin/SipHashSpeedTest -nl 1 --NINPUTS=1_000_000 --INPUTSIZE=8 --dataParTasksPerLocale=1
Hashed 1000000 blocks (8000000 bytes) in 1.389930e+00 seconds
Rate = 5.489049e+00 MB/s So the Python is about 10x faster, even with the overhead of objects and list traversal. From the back-of-the-envelope calculation by @LattigLT , I would expect in the neighborhood of 50-100 MB/s per core, depending on the number of clock cycles required by the various operations. And I did port the siphash implementation that @mhmerrill found here. |
Thanks -- that was really helpful. I have a branch that gets serial performance from ~8 MB/s to ~150 MB/s. It's getting around 3,200 MB/s on a 24 core machine, so core scalability looks pretty good. The main change is to avoid domain creation and slices, both of which involve memory allocation. Both of these operations are cheap, but there's a non-trivial overhead for creating them which becomes a lot more apparent when they're only used once and the operations around them are very fast/short scalar ops. chapel-lang/chapel#8203 has some more discussion on this. I'll work on getting that branch into a PR and see if there's any other micro-optimizations to be made. I'll probably open a new Chapel issue for this too. domain/slice creation are cheap in the context of large parallel operations, but we have seen a few cases where the overhead becomes too high for small kernels. |
And extern'ing out to the C reference you ported I see around 225 MB/s. So there's either some more performance we could squeeze out of the Chapel port, or we could call the C version (looks like it's public domain). Let me know if you have any preferences for which direction you want to go. |
@ronawho this is amazing! I didn't realize domain creation and slicing was so debilitating in an inner loop like that. Makes me wonder where else I've been using it. Regarding whether to use the externed C or the native Chapel: does the C version require Chapel built with LLVM? Because I would rather avoid that dependency, if possible... I would also be curious to know what modifications would bring the Chapel code up to C performance, but that's not a sufficient reason to spend significant time on it. If we can use the C without LLVM, I'd say let's go for that, unless you can easily amend the Chapel code. Thanks again! |
Extern'ing to C would not require llvm. llvm is only required for the extern-block feature, and here we'd just be using Chapel's normal interop features. Here's my prototype branch for that: ronawho@86bb20e I can play with the Chapel version for an other hour or so and see if I can close the gap. If we can get the performance closer (say within 10%), do you have a preference between the extern and pure Chapel approach? I personally don't have a preference. I would probably just use the C version, but I also like being able to showcase that Chapel can perform well for large distributed computations and small local ones (though sometimes you give up some productivity features for the latter.) |
@ronawho I'd like to see Chapel version perform close to the C as possible. Unless this causes some large amount of work on your part. |
Ok, sounds good. I'll push on the Chapel version. |
@reuster986 pulled in my initial optimizations and made some others in #269. I now see ~190 MB/s serial performance on master (bumping the problem size by 100x to get more stable results): $ git rev-parse --short HEAD $ chpl test/SipHashSpeedTest.chpl -M src/ --fast
$ ./SipHashSpeedTest -nl 1 --NINPUTS=100_000_000 --INPUTSIZE=8 --dataParTasksPerLocale=24
|
I have some more experiments in master...ronawho:more-opt-sip-hash. A lot of this is really getting into the weeds, and these are not typically optimizations you have to think about for HPC kernels or longer function calls. To some degree optimizing a hash function is about figuring out how you get the compiler to generate the exact assembly you want. All that to say, don't take these types of optimizations to heart or focus on them too much when writing other parts of Arkouda: Individual commits have more details, but here's a summary:
With the non-gross hacks in that PR, I see around 200 MB/s, and with them I see 310 MB/s. The gross hacks (using DefaultRectangular and using c_ptr) will completely break if the portion of msg you're operating on isn't all local, and I'm not sure what the usage is yet. I would imagine that's usually true, otherwise your comm overhead would probably dominate your execution time and the hash speed wouldn't matter as much, but I'm not sure if it's always true. I was kinda surprised that I was able to get performance better than the C reference. I saw 225 MB/s with it compared to 310 MB/s for the grossest Chapel version. This makes me wonder a little if the bar for how fast SipHash can be is higher. I saw HighwayHash achieves 10 GB/s and claims to be ~5x faster than SipHash, so it's possible the ideal performance is higher than I was thinking. edit: Oh, I missed that those results were with 1,024 byte inputs. If I bump the input size, I see around 2.2 GB/s for the most optimized Chapel version so that fits with being 5x sower than highwayhash. $ ./SipHashSpeedTest -nl 1 --NINPUTS=10_000_000 --INPUTSIZE=1024 --dataParTasksPerLocale=1
How fast do you need your hash to be? Is it fast enough now, or is this an area where it needs to be as fast as possible? |
@ronawho impressive! Our current use case requires us to hash roughly 500 GB of data at "interactive speed" using ~1200 cores. If everything were blocked uniformly as in Actually, maybe a better thing to do is craft a test that cuts a |
Yeah, it would be great to have a realistic benchmark, but the sizes you mentioned also make me think some more optimization of hashing is likely to be worthwhile. A single chunk spanning a locale boundary does make optimization trickier since we need some of the locality checks that make array access more expensive, but we should be able to do something still. There's some Chapel 2.0 tasks I need to work on, so I'm fine waiting. I'm also happy to help with a benchmark, but I don't think I'll beat you to it. |
I have a cleaner version that should be correct even when some msg's are spread across 2 nodes: master...ronawho:more-opt-sip-hash-3. It effectively checks if D.low and D.high are local, and if so uses c_ptr, otherwise it falls back to the current implementation. I think it's still worth tuning on a more realistic benchmark, but the results for SipHashSpeedTest look promising. Serial performance (MB/s):
And 32 node (896 core) XC using odd sizes to ensure some hashes span 2 nodes (GB/s):
|
@ronawho I just committed a more realistic test to the sip-hash-rate branch. With chpl --fast --print-passes -M src test/SipHashSpeedTest.chpl -o test-bin/SipHashSpeedTest
test-bin/SipHashSpeedTest --NINPUTS=1_000_000 --INPUTSIZE=64 --dataParTasksPerLocale=1 --LENGTHS=variable
Hashed 1000000 blocks (65064916 bytes) in 0.25 seconds
Rate = 248.87 MB/s So far, on my laptop, I'm only seeing a 10-15% slowdown from fixed length to variable length at |
Hmm. Single node performance looks good to me, but I'm seeing 2 issues for multi-locale:
|
To the first question, the length code is computing the length of each chunk. forall (l, s, i) in zip(lengths, segs, D) {
if i == D.high {
l = vals.size - s;
} else {
l = s[i+1] - s;
}
} For the second point, I would say there is significant non-local data, but I would be surprised if most of it is non-local. Would something like this work better? forall (h, i, l) in zip(hashes, segs, lengths) {
on vals[i] {
h = sipHash128(vals, i..#l);
}
} This would restore near-perfect locality, at the cost of lots of remote executions. |
Curiously the loop based length calculation is much much faster. 0.035s vs 25.585s for For 4 nodes I see ~70 GB/s for fixed length, ~1 GB/s for variable length (and much worse scaling at 32 nodes). I see around 7 GB/s for the on-stmt version. Compared to the speed of hashing, the on-stmt and PUT'ing the hash value back will have overhead:
I would guess we'll need to do a coforall. Something like: coforall loc in Locales do on loc {
var myVals = vals[vals.localSubdomain()];
// TODO scan and compute local start / lengths
var agg = newDstAggregator(2*uint);
for (h, i, l) in zip (myHashes, mySegs, myLengths) {
agg.copy(h, sipHash128(myVals, i..#l));
}
agg.flush();
}
|
A different option might be to pad so that all chunks have uniform size, but that is likely to have an unacceptably high memory overhead. |
master...ronawho:more-opt-sip-hash-4 uses bulk comm to GET remote messages instead of GET'ing them element at a time. This has better performance than the on-stmt version, but is still 3.5x slower than the fixed version at 32 nodes:
This might be ok as a stepping stone if you need better performance soon, but I think the manual coforall will still probably be needed for best performance. The bulk GET version works ok for this test because like you said most of the data is still local, but I think that only happens in this case because the size per locale is still relatively uniform. If you have real world data where the 90% of the strings are small and fit on a single node and the rest are spread out across the other nodes, I would expect this to have poor locality. I'm timing out on this for the next few days. Let me know if you'd like me to try and PR this improvement or hold off until I have time to look at the manual coforall version. |
@ronawho I think it's worth making a PR with what you have now. I will then be able to take some measurements and talk to users to see if diving into the manual coforall is worthwhile. |
Sounds good. I will clean up that branch a bit and PR. The length calculation being slow is a bit subtle, so I wanted to call out the reason. The problem is that in SipHashSpeedTest, the calculation uses local domains: https://github.com/mhmerrill/arkouda/blob/5e1581a48e7a1eadb85059d437c276b51e2e5b2c/test/SipHashSpeedTest.chpl#L38-L40 Since chapel-lang/chapel#12931, slice expressions are governed by the slicing domain. Since the domains used to slice are local, your length calculation is local, which means everything is running on locale 0 using GETs/PUTs. I believe the actual length calculation in arkouda is slicing with ranges: Today, this gives you distributed slices, but I'm not sure that will be true longer term. @bradcray will know more about that though. Here are a couple variations for computing the length (not positive they're all correct, but I think they are): use RandArray;
use Time;
config const NINPUTS = 100_000;
config const INPUTSIZE = 8;
enum operation {loop, domainSlice, rangeSlice, interior};
const logMean:real = log(INPUTSIZE:real)/2;
const logStd:real = sqrt(2*logMean);
var (segs, vals) = newRandStringsLogNormalLength(NINPUTS, logMean, logStd);
const D = segs.domain;
var lengths: [D] int;
for op in operation {
resetCommDiagnostics(); startCommDiagnostics();
var t = new Timer(); t.start();
if op == operation.loop {
forall (l, s, i) in zip(lengths, segs, D) {
if i == D.high then l = vals.size - s;
else l = segs[i+1] - s;
}
} else if op == operation.domainSlice {
lengths[{D.low..D.high-1}] = segs[{D.low+1..D.high}] - segs[{D.low..D.high-1}];
lengths[D.high] = vals.size - segs[D.high];
} else if op == operation.rangeSlice {
lengths[D.low..D.high-1] = segs[D.low+1..D.high] - segs[D.low..D.high-1];
lengths[D.high] = vals.size - segs[D.high];
} else if op == operation.interior {
const s = D.size-1;
lengths[D.interior(-s)] = segs[D.interior(s)] - segs[D.interior(-s)];
lengths[D.high] = vals.size - segs[D.high];
}
t.stop();
stopCommDiagnostics();
writef("%-12s: %.3dr first-%s last-%s\n", op:string, t.elapsed(), getCommDiagnostics()[0]:string, getCommDiagnostics()[numLocales-1]:string);
} Run on 16 nodes of an InfiniBand cluster:
The local domain slice is really bad since everything is running on locale 0. The others are all good. The loop is fastest since there's no distributed domain creation, but I would expect the performance difference to be negligible compared to the time to hash. I'd probably avoid the rangeSlice since I believe we expect that to behave like a local domain the future. |
Improve comm diags and speed up length calculation(Bears-R-Us#266 (comment))
Improve comm diags and speed up length calculation(Bears-R-Us#266 (comment))
Wow, that domain slicing is a real doozie. We use it all over the arkouda codebase (although I'm not sure how much of it is non-local), and I had no idea it was so bad. Thanks for pointing this out! |
This makes me want to cry (the state of things, not what you said :) ). I was on a tear to make Chapel slicing much cheaper awhile back, but then had to put it aside to work on Chapel 2.0 issues and other "distractions." Speaking of that effort, @ronawho, did you try getting timings with the chpl_serializeSlices flag (which you may remember made many cases much better, but a few a bit worse, so we didn't turn it on by default). |
lengths[{D.low..D.high-1}] = segs[{D.low+1..D.high}] - segs[{D.low..D.high-1}]; you really want that iteration to be distributed. |
Got it, thanks for clarifying (I've only given this the most cursory look so far). |
With #287 in, here is my current understanding/summary:
I have some concern that highly variable message sizes where most data is remote will hurt performance a lot and may require the manual coforall version. I think we're waiting for results on real data at this point to gauge if that's needed, so I'm going to un-assign myself, but definitely let me know what you find for operating on real data. |
On real data, it's looking like the hashing step is only taking about 5% of the overall sorting time, so I'm going to call this good for now, with the option to revisit later. |
Just for archival purposes @reuster986 switched to the manual coforall version in #1060 |
The current implementation of SipHash found in arkouda takes ~6.5 seconds to hash 1,000,000 bytes on a 16-core, 2.5 GHz Haswell processor. By way of comparison, hashing a million integers with Python3 (which uses SipHash) on the same hardware takes ~.23 seconds.
Note: I specified integers because I haven't taken a good look at how python works under the hood. The integers all ranged from 0-256, but I don't necessarily expect that to have too much bearing on runtime in a Python context.
Having taken a conservative tally of ~400 cycles per block of SipHash, the expected time to hash 1,000,000 bytes @2.5GHz is ~0.16 seconds. Any help in figuring out what could be going wrong would be greatly appreciated.
The text was updated successfully, but these errors were encountered: