New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request : external memory construction / partitioned hash #1
Comments
Yeah, it would for sure be good to support this, but it will take time. Most likely I'll be excited about this for a few weeks and then slowly it turns into a chore, but at that time pull requests will also be welcome of course :) Anyway, feel free to make all the feature requests you can think of -- having new things to work on is always good and if I don't then documenting that fact also doesn't hurt :P |
Allright so as said, this is now ready for testing. Re. this specific issue: As you wrote, there are two options here:
The current architecture is partitioned by default into ~250k sized chunks, so both options could work. To me option 2 sounds nicer as it avoids writing a lot of disk space that's only read once, and especially when keys are simply <=64bit kmers, streaming files and hashing kmers will probably be fast anyway. Specifically, for option 2 you could stream all keys and store only those whose hash belongs in one of the first K parts, where K is such that they are expected to occupy ~32GB of memory (i.e. ~4*10^9 elements). Do you have any insight in tradeoffs between the options? Option 2 should be straightforward to support by creating a function that constructs from a clonable iterator, while option 1 sounds a bit more involved. |
So I think from the perspective of a user of this library, either of the options is likely to work well. I'd probably ping @jermp as to the relevant tradeoffs between the two approaches. I think you covered the main qualities of the second approach quite well. In general, I can think of only one strong benefit of the first approach. In that case, the actual input to the hash function construction can be streaming. For example, you could imagine streaming a large file over the network and reading and sharding the keys directly into on-disk buckets. In that case, you don't even need to look at the input again (it could be on the network, or streamed in via a pipe or process substitution etc.). However, for constructing a really large hash, I don't know how realistic of a use-case scenario this is, or how desirable this property would be. |
SigStore for sux-rs does exactly this, and it's ready to use. I build regularly static functions with 10^11 keys in less than three hours and little memory usage beyond the function itself. The store keeps track of 128-bit keys, but I guess that should be enough for PTRHash. |
Ok, I'll try to implement option 2. Indeed, memory usage for this should be limited to size of the output datastructure plus the size of shard. @vigna with 'this' you also meant option 2 right? |
You can do both with SigStore. |
With SigStore you mean this right: Would be helpful if you could publish a version to crates.io; I don't think crates.io accepts git dependencies in case I want to push there myself. |
I know. sux should be published real soon... |
I think, Option 1 is better (and more elegant) for real-world key sets of, say, billions of keys. PTHash-HEM implements this strategy and it works very well: keys are hashed once and for all and hashes written to disk in partitions. Then we load into memory as many partition as we can without exceeding the user's specified RAM limit. Say we can load T partitions: we process each partition in parallel. So we use T threads for building. Otherwise we go for a monolithic MPHF but much more involved parallel construction algorithms. |
I think we should be precise about exactly how many keys we are all thinking of :) Each shard should be able to handle something like 4 billion keys. That means that for <=10-40 billion keys, option 2 of reading multiple times sounds pretty neat to me, as the MPHF itself will fit in memory anyway. But indeed once the number of keys becomes much larger (>=40 billion) this sounds infeasible and writing hashes to disk becomes nicer. At this point one would probably also want to put the MPHF itself to disk. Note that PTRHash uses relatively small partitions of ~200k keys already, so each shard would contain many parts and parallellization should be independent of the chosen sharding method. |
I've pushed support for both sharding options now. By default it uses 2^33 keys/shard corresponding to roughly 32GB of memory. Set So I think anything up to 2^40 keys should actually work now (apart from 128bit hashing, #4), as long as you have enough space to store the ~2.5 bits/key needed for the main datastructure itself with enough buffer left to process each shard. |
I don't have a meaningful comment here other than bravo! That was super quick ;P. We can test the 2 strategies now. Any thoughts on when 128-bit hash support might hit? Then we can index all the things! |
Thanks! 128bit hashes will come tomorrow. It's basically trivial to add once I find a good default hash function. |
Not sure if you are considering feature requests, so if not please just let me know and I won't bug you here until you're ready ;P.
One huge use case for MPHFs is the case when you cannot fut all of the keys in RAM in their naive representation (e.g. kmers from a compacted dBG). In this case, it's an essential feature of MPHF construction that keys can reside on disk and be visited in multiple passes for construction. Alternatively (or additionally), I know the C++ version of pthash also allows partitioned MPHFs where a top-level function maps the key space to separate distinct MPHFs (this need not be explicit and could be e.g. the minimizer of a kmer). This allows working with independent subsets of the keys at a time.
It would be great to have such a feature in the rust implementation of pthash!
The text was updated successfully, but these errors were encountered: