Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SipHash underperforming #266

Closed
LattigLT opened this issue Jan 23, 2020 · 31 comments
Closed

SipHash underperforming #266

LattigLT opened this issue Jan 23, 2020 · 31 comments

Comments

@LattigLT
Copy link
Collaborator

The current implementation of SipHash found in arkouda takes ~6.5 seconds to hash 1,000,000 bytes on a 16-core, 2.5 GHz Haswell processor. By way of comparison, hashing a million integers with Python3 (which uses SipHash) on the same hardware takes ~.23 seconds.

Note: I specified integers because I haven't taken a good look at how python works under the hood. The integers all ranged from 0-256, but I don't necessarily expect that to have too much bearing on runtime in a Python context.

Having taken a conservative tally of ~400 cycles per block of SipHash, the expected time to hash 1,000,000 bytes @2.5GHz is ~0.16 seconds. Any help in figuring out what could be going wrong would be greatly appreciated.

@ronawho ronawho self-assigned this Jan 23, 2020
@ronawho
Copy link
Contributor

ronawho commented Jan 23, 2020

@LattigLT can you share the test/benchmark you were using to compare performance?

@reuster986
Copy link
Collaborator

@ronawho I believe the Chapel benchmark was https://github.com/mhmerrill/arkouda/blob/master/test/SipHashSpeedTest.chpl

I haven't personally seen the python benchmark code, but from what he described, I believe it was something like:

x = list(range(10**6))
start = time()
s = set(x)
t = time() - start

@mhmerrill
Copy link
Contributor

mhmerrill commented Jan 23, 2020

@ronawho here is the GitHub project for SipHash https://github.com/veorq/SipHash
There are some performance charts in the SipHash paper https://131002.net/siphash/siphash.pdf
here is a page on SipHash https://131002.net/siphash/
here is the openssl implementation https://github.com/openssl/openssl/tree/master/crypto/siphash
Is it worth trying to link in the openssl version???

@ronawho
Copy link
Contributor

ronawho commented Jan 23, 2020

@reuster986 Thanks, I'm using SipHashSpeedTest.chpl to start investigating. Do you have a rate you're expecting/hoping for. I see ~10MB/s per core, but I have no sense for whether that's good or bad.

edit:

Oh, and do you have a reference code that you ported the Chapel code from?

@reuster986
Copy link
Collaborator

See here for a python test. Here's what I get on my laptop (1.87GHz haswell):

$ tests/python_hash_rate.py --NINPUTS=1_000_000 --INPUTSIZE=8
Hashed 1000000 blocks (8000000 bytes) in 0.136 seconds
Rate = 55.93 MB/s

As opposed to the Chapel version (using single core for comparison):

$ test-bin/SipHashSpeedTest -nl 1 --NINPUTS=1_000_000 --INPUTSIZE=8 --dataParTasksPerLocale=1
Hashed 1000000 blocks (8000000 bytes) in 1.389930e+00 seconds
Rate = 5.489049e+00 MB/s

So the Python is about 10x faster, even with the overhead of objects and list traversal. From the back-of-the-envelope calculation by @LattigLT , I would expect in the neighborhood of 50-100 MB/s per core, depending on the number of clock cycles required by the various operations.

And I did port the siphash implementation that @mhmerrill found here.

@ronawho
Copy link
Contributor

ronawho commented Jan 25, 2020

Thanks -- that was really helpful. I have a branch that gets serial performance from ~8 MB/s to ~150 MB/s. It's getting around 3,200 MB/s on a 24 core machine, so core scalability looks pretty good.

The main change is to avoid domain creation and slices, both of which involve memory allocation. Both of these operations are cheap, but there's a non-trivial overhead for creating them which becomes a lot more apparent when they're only used once and the operations around them are very fast/short scalar ops. chapel-lang/chapel#8203 has some more discussion on this.

I'll work on getting that branch into a PR and see if there's any other micro-optimizations to be made.

I'll probably open a new Chapel issue for this too. domain/slice creation are cheap in the context of large parallel operations, but we have seen a few cases where the overhead becomes too high for small kernels.

@ronawho
Copy link
Contributor

ronawho commented Jan 26, 2020

And extern'ing out to the C reference you ported I see around 225 MB/s. So there's either some more performance we could squeeze out of the Chapel port, or we could call the C version (looks like it's public domain). Let me know if you have any preferences for which direction you want to go.

@reuster986
Copy link
Collaborator

@ronawho this is amazing! I didn't realize domain creation and slicing was so debilitating in an inner loop like that. Makes me wonder where else I've been using it.

Regarding whether to use the externed C or the native Chapel: does the C version require Chapel built with LLVM? Because I would rather avoid that dependency, if possible... I would also be curious to know what modifications would bring the Chapel code up to C performance, but that's not a sufficient reason to spend significant time on it. If we can use the C without LLVM, I'd say let's go for that, unless you can easily amend the Chapel code.

Thanks again!

@ronawho
Copy link
Contributor

ronawho commented Jan 27, 2020

Extern'ing to C would not require llvm. llvm is only required for the extern-block feature, and here we'd just be using Chapel's normal interop features. Here's my prototype branch for that: ronawho@86bb20e

I can play with the Chapel version for an other hour or so and see if I can close the gap. If we can get the performance closer (say within 10%), do you have a preference between the extern and pure Chapel approach? I personally don't have a preference. I would probably just use the C version, but I also like being able to showcase that Chapel can perform well for large distributed computations and small local ones (though sometimes you give up some productivity features for the latter.)

@mhmerrill
Copy link
Contributor

mhmerrill commented Jan 27, 2020

@ronawho I'd like to see Chapel version perform close to the C as possible. Unless this causes some large amount of work on your part.

@ronawho
Copy link
Contributor

ronawho commented Jan 27, 2020

Ok, sounds good. I'll push on the Chapel version.

@ronawho
Copy link
Contributor

ronawho commented Jan 29, 2020

@reuster986 pulled in my initial optimizations and made some others in #269. I now see ~190 MB/s serial performance on master (bumping the problem size by 100x to get more stable results):

$ git rev-parse --short HEAD
63e2428

$ chpl test/SipHashSpeedTest.chpl -M src/ --fast
$ ./SipHashSpeedTest -nl 1 --NINPUTS=100_000_000 --INPUTSIZE=8 --dataParTasksPerLocale=1

Rate = 189.00 MB/s

$ ./SipHashSpeedTest -nl 1 --NINPUTS=100_000_000 --INPUTSIZE=8 --dataParTasksPerLocale=24

Rate = 4036.16 MB/s

@ronawho
Copy link
Contributor

ronawho commented Jan 29, 2020

I have some more experiments in master...ronawho:more-opt-sip-hash. A lot of this is really getting into the weeds, and these are not typically optimizations you have to think about for HPC kernels or longer function calls. To some degree optimizing a hash function is about figuring out how you get the compiler to generate the exact assembly you want. All that to say, don't take these types of optimizations to heart or focus on them too much when writing other parts of Arkouda:


Individual commits have more details, but here's a summary:

  • There's a change to remove throws, which has a minor perf overhead (Is lower-overhead error handling possible? chapel-lang/chapel#9380)

  • There's also a few micro-optimizations with unrolling loops and making things param, but I'm not sure they had any impact.

  • The key thing that I'm seeing is that there is overhead for array access. Assessing a block distributed array involves a locality check, which has a non-zero amount of overhead. Something like msg[i] will turn into something roughly like:

    if (msg.localArray != nil && i >= msg.localArray.domain.low && i <= msg.localArray.domain.high) {
      return localArray[i];
    } else {
      remoteRef = msg.slowIshCodeToFindRemoteAddress(i);
      ret = GET(remoteRef);
      return ret;
    }

    I see a decent speedup if I switch the benchmark from BlockDist to DefaultRectangular (a local array).

  • I can get an even further speedup by getting a c_ptr reference to the start of the array. I think this is because Chapel arrays have a slight overhead for random indexing since our arrays aren't 0-based. We store an offset pointer, which adds some extra indirection. I'm not sure if there's actually overhead in this, or if the code we generate just obfuscates things enough that the backend can't fully optimize. In this case I think what we really want is an 8-byte load followed by a bswap, so we might have to play some games to try and get that.

With the non-gross hacks in that PR, I see around 200 MB/s, and with them I see 310 MB/s. The gross hacks (using DefaultRectangular and using c_ptr) will completely break if the portion of msg you're operating on isn't all local, and I'm not sure what the usage is yet. I would imagine that's usually true, otherwise your comm overhead would probably dominate your execution time and the hash speed wouldn't matter as much, but I'm not sure if it's always true.


I was kinda surprised that I was able to get performance better than the C reference. I saw 225 MB/s with it compared to 310 MB/s for the grossest Chapel version. This makes me wonder a little if the bar for how fast SipHash can be is higher.

I saw HighwayHash achieves 10 GB/s and claims to be ~5x faster than SipHash, so it's possible the ideal performance is higher than I was thinking.

edit: Oh, I missed that those results were with 1,024 byte inputs. If I bump the input size, I see around 2.2 GB/s for the most optimized Chapel version so that fits with being 5x sower than highwayhash.

$ ./SipHashSpeedTest -nl 1 --NINPUTS=10_000_000 --INPUTSIZE=1024 --dataParTasksPerLocale=1

Rate = 2247.73 MB/s

How fast do you need your hash to be? Is it fast enough now, or is this an area where it needs to be as fast as possible?

@reuster986
Copy link
Collaborator

@ronawho impressive!

Our current use case requires us to hash roughly 500 GB of data at "interactive speed" using ~1200 cores. If everything were blocked uniformly as in SipHashSpeedTest.chpl, and assuming perfect multi-locale scaling at the rates you're seeing, the hashing would take about 3 seconds. This would be sufficiently fast if we could achieve it, but the chunks of data we're hashing vary in size with some heavy-tailed distribution (average is about 100 bytes, max length is in the thousands). Also, the chunks are stored in a flat BlockDist array, so up to numLocales - 1 chunks will span a locale boundary. Given these two considerations, I would think tuning the work distribution would start to matter a lot, maybe more than tuning hash speed. The real answer is: I need to test the performance on real data, but unfortunately I won't be able to until next week.

Actually, maybe a better thing to do is craft a test that cuts a BlockDist array of random bytes into variable-sized chunks (ideally with chunk sizes drawn from something like a log-normal distribution) and hashes those chunks in a multi-locale setting. Then we can explore scaling and speed on something more like the real data. Unfortunately, for the same reason as above, I won't be able to make such a test until next week. Would you want to take a crack at it? If Chapel has a way to generate standard normal random numbers, I can help turn that into the appropriate log-normal distribution.

@ronawho
Copy link
Contributor

ronawho commented Jan 30, 2020

Yeah, it would be great to have a realistic benchmark, but the sizes you mentioned also make me think some more optimization of hashing is likely to be worthwhile. A single chunk spanning a locale boundary does make optimization trickier since we need some of the locality checks that make array access more expensive, but we should be able to do something still.

There's some Chapel 2.0 tasks I need to work on, so I'm fine waiting. I'm also happy to help with a benchmark, but I don't think I'll beat you to it.

@ronawho
Copy link
Contributor

ronawho commented Feb 4, 2020

I have a cleaner version that should be correct even when some msg's are spread across 2 nodes: master...ronawho:more-opt-sip-hash-3. It effectively checks if D.low and D.high are local, and if so uses c_ptr, otherwise it falls back to the current implementation.

I think it's still worth tuning on a more realistic benchmark, but the results for SipHashSpeedTest look promising. Serial performance (MB/s):

$ ./SipHashSpeedTest -nl 1 --NINPUTS=10_000_000 --INPUTSIZE=$SIZE --dataParTasksPerLocale=1
SIZE Before After
10 198 MB/s 266 MB/s
100 482 MB/s 1339 MB/s
1000 564 MB/s 1911 MB/s

And 32 node (896 core) XC using odd sizes to ensure some hashes span 2 nodes (GB/s):

./SipHashSpeedTest -nl 32 --NINPUTS=100_000_000_000 --INPUTSIZE=11
./SipHashSpeedTest -nl 32 --NINPUTS=10_000_000_000 --INPUTSIZE=101
./SipHashSpeedTest -nl 32 --NINPUTS=1_000_000_000 --INPUTSIZE=1001
SIZE Before After
11 104 GB/s 112 GB/s
101 203 GB/s 676 GB/s
1001 231 GB/s 1267 GB/s

@reuster986
Copy link
Collaborator

@ronawho I just committed a more realistic test to the sip-hash-rate branch. With --LENGTHS=variable, the SipHashSpeedTest now generates blocks with log-normally distributed sizes, which should better reflect real use cases. In this mode, it interprets INPUTSIZE as the mean block length. Example run on my laptop:

chpl --fast --print-passes -M src test/SipHashSpeedTest.chpl -o test-bin/SipHashSpeedTest

test-bin/SipHashSpeedTest --NINPUTS=1_000_000 --INPUTSIZE=64 --dataParTasksPerLocale=1 --LENGTHS=variable
Hashed 1000000 blocks (65064916 bytes) in 0.25 seconds
Rate = 248.87 MB/s

So far, on my laptop, I'm only seeing a 10-15% slowdown from fixed length to variable length at --INPUTSIZE=64, so that's encouraging. But I'll hand it over to you now :)

@ronawho
Copy link
Contributor

ronawho commented Feb 4, 2020

Hmm. Single node performance looks good to me, but I'm seeing 2 issues for multi-locale:

  • The length calculation seems pretty slow. What computation is that code doing (maybe what would the loop based version of that look like?)
  • There is a ton of comm for the sipHash128 call itself, which indicates that most of the data is remote and not local.

@reuster986
Copy link
Collaborator

To the first question, the length code is computing the length of each chunk. segs is the starting index for each chunk, so the code is computing the length by taking the difference between the next index and the current index (handling the edge case for the last index). Loop code looks like:

forall (l, s, i) in zip(lengths, segs, D) {
  if i == D.high {
    l = vals.size - s;
  } else {
    l = s[i+1] - s;
  }
}

For the second point, I would say there is significant non-local data, but I would be surprised if most of it is non-local. vals holds the variable-length chunks to be hashed, and segs tells where each chunk's data begins and ends. vals and segs are both block-distributed arrays, but their locality might not line up. The current code ties locality of execution to segs, effectively assigning an equal number of chunks to each locale. If the chunks were fixed-length, this would result in near-perfect locality (at most numLocales - 1 chunks would span a locale boundary). However, with variable-length chunks, locality is only approximate, i.e. a locale might get assigned a bunch of long chunks, and the data for those chunks would start to run over onto the next locale's part of vals. The right way to do this loop might be with a coforall, where each locale grabs its local part of vals and seeks to the correct (possibly non-local) offset in segs. This would also involve some communication, though, plus the time spent seeking. I don't have a sense for whether it would be better.

Would something like this work better?

forall (h, i, l) in zip(hashes, segs, lengths) {
  on vals[i] {
    h = sipHash128(vals, i..#l);
  }
}

This would restore near-perfect locality, at the cost of lots of remote executions.

@ronawho
Copy link
Contributor

ronawho commented Feb 4, 2020

Curiously the loop based length calculation is much much faster. 0.035s vs 25.585s for -nl 32 --NINPUTS=100_000_000 --INPUTSIZE=64. Something to look at later, but for now I'm just switching to the loop based version.


For 4 nodes I see ~70 GB/s for fixed length, ~1 GB/s for variable length (and much worse scaling at 32 nodes).

I see around 7 GB/s for the on-stmt version. Compared to the speed of hashing, the on-stmt and PUT'ing the hash value back will have overhead:

$ ./SipHashSpeedTest -nl 4 --NINPUTS=1_000_000_001 --INPUTSIZE=64 --LENGTHS=fixed --COMMDEBUG
(get = 55, execute_on_nb = 3) (get = 39) (get = 23) (<no communication>)
Hashed 1000000001 blocks (64000000064 bytes) in 0.86 seconds
Rate = 71091.97 MB/s

$ ./SipHashSpeedTest -nl 4 --NINPUTS=100_000_001 --INPUTSIZE=64 --LENGTHS=variable --COMMDEBUG
(get = 2626579, execute_on_nb = 3) (get = 2235647) (get = 1919529) (<no communication>)
Hashed 100000001 blocks (6455814272 bytes) in 5.48 seconds
Rate = 1123.24 MB/s

$ ./SipHashSpeedTest-on -nl 4 --NINPUTS=100_000_001 --INPUTSIZE=64 --LENGTHS=variable --COMMDEBUG
(get = 19636, put = 9770, execute_on_nb = 3) (get = 93115, put = 46452, execute_on = 9770) (get = 87778, put = 43847, execute_on = 46452) (get = 7, execute_on = 43847)
Hashed 100000001 blocks (6451327520 bytes) in 0.95 seconds
Rate = 6485.60 MB/s

I would guess we'll need to do a coforall. Something like:

coforall loc in Locales do on loc {
  var myVals = vals[vals.localSubdomain()];
  // TODO scan and compute local start / lengths 
  var agg = newDstAggregator(2*uint);
  for (h, i, l) in zip (myHashes, mySegs, myLengths) {
    agg.copy(h, sipHash128(myVals, i..#l));
  }
  agg.flush();
}

@ronawho
Copy link
Contributor

ronawho commented Feb 4, 2020

A different option might be to pad so that all chunks have uniform size, but that is likely to have an unacceptably high memory overhead.

@ronawho
Copy link
Contributor

ronawho commented Feb 4, 2020

master...ronawho:more-opt-sip-hash-4 uses bulk comm to GET remote messages instead of GET'ing them element at a time. This has better performance than the on-stmt version, but is still 3.5x slower than the fixed version at 32 nodes:

$ ./SipHashSpeedTest -nl 32 --NINPUTS=10_000_000_001 --INPUTSIZE=64 --LENGTHS=fixed
Hashed 10000000001 blocks (640000000064 bytes) in 0.95 seconds
Rate = 642431.37 MB/s

$ ./SipHashSpeedTest -nl 32 --NINPUTS=10_000_000_001 --INPUTSIZE=64 --LENGTHS=variable
Hashed 10000000001 blocks (645160996021 bytes) in 3.09 seconds
Rate = 198835.72 MB/s

This might be ok as a stepping stone if you need better performance soon, but I think the manual coforall will still probably be needed for best performance. The bulk GET version works ok for this test because like you said most of the data is still local, but I think that only happens in this case because the size per locale is still relatively uniform. If you have real world data where the 90% of the strings are small and fit on a single node and the rest are spread out across the other nodes, I would expect this to have poor locality.

I'm timing out on this for the next few days. Let me know if you'd like me to try and PR this improvement or hold off until I have time to look at the manual coforall version.

@reuster986
Copy link
Collaborator

@ronawho I think it's worth making a PR with what you have now. I will then be able to take some measurements and talk to users to see if diving into the manual coforall is worthwhile.

@ronawho
Copy link
Contributor

ronawho commented Feb 12, 2020

Sounds good. I will clean up that branch a bit and PR.


The length calculation being slow is a bit subtle, so I wanted to call out the reason. The problem is that in SipHashSpeedTest, the calculation uses local domains: https://github.com/mhmerrill/arkouda/blob/5e1581a48e7a1eadb85059d437c276b51e2e5b2c/test/SipHashSpeedTest.chpl#L38-L40

Since chapel-lang/chapel#12931, slice expressions are governed by the slicing domain. Since the domains used to slice are local, your length calculation is local, which means everything is running on locale 0 using GETs/PUTs.

I believe the actual length calculation in arkouda is slicing with ranges:
https://github.com/mhmerrill/arkouda/blob/5e1581a48e7a1eadb85059d437c276b51e2e5b2c/src/SegmentedArray.chpl#L337-L348

Today, this gives you distributed slices, but I'm not sure that will be true longer term. @bradcray will know more about that though.


Here are a couple variations for computing the length (not positive they're all correct, but I think they are):

use RandArray;
use Time;

config const NINPUTS = 100_000;
config const INPUTSIZE = 8;

enum operation {loop, domainSlice, rangeSlice, interior};

const logMean:real = log(INPUTSIZE:real)/2;
const logStd:real = sqrt(2*logMean);
var (segs, vals) = newRandStringsLogNormalLength(NINPUTS, logMean, logStd);
const D = segs.domain;
var lengths: [D] int;

for op in operation {
  resetCommDiagnostics(); startCommDiagnostics();
  var t = new Timer(); t.start();

  if op == operation.loop {
    forall (l, s, i) in zip(lengths, segs, D) {
      if i == D.high then l = vals.size - s;
                     else l = segs[i+1] - s;
    }
  } else if op == operation.domainSlice {
    lengths[{D.low..D.high-1}] = segs[{D.low+1..D.high}] - segs[{D.low..D.high-1}];
    lengths[D.high] = vals.size - segs[D.high];
  } else if op == operation.rangeSlice {
    lengths[D.low..D.high-1] = segs[D.low+1..D.high] - segs[D.low..D.high-1];
    lengths[D.high] = vals.size - segs[D.high];
  } else if op == operation.interior {
    const s = D.size-1;
    lengths[D.interior(-s)] = segs[D.interior(s)] - segs[D.interior(-s)];
    lengths[D.high] = vals.size - segs[D.high];
  }

  t.stop();
  stopCommDiagnostics();

  writef("%-12s: %.3dr first-%s last-%s\n", op:string, t.elapsed(), getCommDiagnostics()[0]:string, getCommDiagnostics()[numLocales-1]:string);
}

Run on 16 nodes of an InfiniBand cluster:

$ ./SipHashSpeedTest -nl 16
loop        : 0.001 first-(get = 8, execute_on_nb = 15) last-(execute_on_fast = 1)
domainSlice : 1.263 first-(get = 187703, put = 93750, execute_on = 12) last-(<no communication>)
rangeSlice  : 0.010 first-(get = 2, put = 1, execute_on = 18, execute_on_nb = 105) last-(get = 177, put = 3, execute_on_fast = 7)
interior    : 0.009 first-(get = 2, put = 1, execute_on = 24, execute_on_nb = 105) last-(get = 177, put = 3, execute_on_fast = 7)

The local domain slice is really bad since everything is running on locale 0. The others are all good. The loop is fastest since there's no distributed domain creation, but I would expect the performance difference to be negligible compared to the time to hash. I'd probably avoid the rangeSlice since I believe we expect that to behave like a local domain the future.

ronawho added a commit to ronawho/arkouda that referenced this issue Feb 12, 2020
Improve comm diags and speed up length calculation(Bears-R-Us#266 (comment))
ronawho added a commit to ronawho/arkouda that referenced this issue Feb 12, 2020
Improve comm diags and speed up length calculation(Bears-R-Us#266 (comment))
@reuster986
Copy link
Collaborator

Wow, that domain slicing is a real doozie. We use it all over the arkouda codebase (although I'm not sure how much of it is non-local), and I had no idea it was so bad. Thanks for pointing this out!

@bradcray
Copy link
Contributor

Wow, that domain slicing is a real doozie.

This makes me want to cry (the state of things, not what you said :) ). I was on a tear to make Chapel slicing much cheaper awhile back, but then had to put it aside to work on Chapel 2.0 issues and other "distractions."

Speaking of that effort, @ronawho, did you try getting timings with the chpl_serializeSlices flag (which you may remember made many cases much better, but a few a bit worse, so we didn't turn it on by default).

@ronawho
Copy link
Contributor

ronawho commented Feb 12, 2020

-schpl_serializeSlices doesn't help, in this case because it still results in a local domain driving the main loop and you really want it to be distributed. i.e. for

lengths[{D.low..D.high-1}] = segs[{D.low+1..D.high}] - segs[{D.low..D.high-1}];

you really want that iteration to be distributed.

@bradcray
Copy link
Contributor

Got it, thanks for clarifying (I've only given this the most cursory look so far).

@ronawho
Copy link
Contributor

ronawho commented Feb 12, 2020

With #287 in, here is my current understanding/summary:

  • initially, local serial performance was ~8 MB/s, it is now ~250 MB/s for 8-byte messages and ~2 GB/s for 1024-byte messages. This is on par with the C reference you ported from, so I'm happy with that.
  • initially, distributed performance was pretty bad. I think it is now near optimal for fixed sized 64-byte messages (~500 GB/s at 32 locales), and within ~3x for variable sized ~64-byte messages (~200 GB/s at 32 locales). For the variable sized messages test, generally speaking each locale has roughly the same number of segments to operate on, so only a small-ish amount of data is remote.

I have some concern that highly variable message sizes where most data is remote will hurt performance a lot and may require the manual coforall version. I think we're waiting for results on real data at this point to gauge if that's needed, so I'm going to un-assign myself, but definitely let me know what you find for operating on real data.

@ronawho ronawho removed their assignment Feb 12, 2020
@reuster986
Copy link
Collaborator

On real data, it's looking like the hashing step is only taking about 5% of the overall sorting time, so I'm going to call this good for now, with the option to revisit later.

@ronawho
Copy link
Contributor

ronawho commented Feb 9, 2022

Just for archival purposes @reuster986 switched to the manual coforall version in #1060

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants