Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request : external memory construction / partitioned hash #1

Closed
rob-p opened this issue Sep 22, 2023 · 13 comments
Closed

Feature request : external memory construction / partitioned hash #1

rob-p opened this issue Sep 22, 2023 · 13 comments

Comments

@rob-p
Copy link

rob-p commented Sep 22, 2023

Not sure if you are considering feature requests, so if not please just let me know and I won't bug you here until you're ready ;P.

One huge use case for MPHFs is the case when you cannot fut all of the keys in RAM in their naive representation (e.g. kmers from a compacted dBG). In this case, it's an essential feature of MPHF construction that keys can reside on disk and be visited in multiple passes for construction. Alternatively (or additionally), I know the C++ version of pthash also allows partitioned MPHFs where a top-level function maps the key space to separate distinct MPHFs (this need not be explicit and could be e.g. the minimizer of a kmer). This allows working with independent subsets of the keys at a time.

It would be great to have such a feature in the rust implementation of pthash!

@RagnarGrootKoerkamp
Copy link
Owner

Yeah, it would for sure be good to support this, but it will take time.

Most likely I'll be excited about this for a few weeks and then slowly it turns into a chore, but at that time pull requests will also be welcome of course :)

Anyway, feel free to make all the feature requests you can think of -- having new things to work on is always good and if I don't then documenting that fact also doesn't hurt :P

@RagnarGrootKoerkamp
Copy link
Owner

RagnarGrootKoerkamp commented Nov 11, 2023

Allright so as said, this is now ready for testing. Re. this specific issue:

As you wrote, there are two options here:

  1. Read keys once, shard hashes into multiple partitions, where each partition is written to its own file.
  2. Read keys multiple times, and in each pass only process a subset of keys.

The current architecture is partitioned by default into ~250k sized chunks, so both options could work. To me option 2 sounds nicer as it avoids writing a lot of disk space that's only read once, and especially when keys are simply <=64bit kmers, streaming files and hashing kmers will probably be fast anyway.

Specifically, for option 2 you could stream all keys and store only those whose hash belongs in one of the first K parts, where K is such that they are expected to occupy ~32GB of memory (i.e. ~4*10^9 elements).
As long as the total number of keys is not orders of magnitude more than that, streaming the keys ~10 times sounds reasonable.

Do you have any insight in tradeoffs between the options? Option 2 should be straightforward to support by creating a function that constructs from a clonable iterator, while option 1 sounds a bit more involved.

@rob-p
Copy link
Author

rob-p commented Nov 12, 2023

So I think from the perspective of a user of this library, either of the options is likely to work well. I'd probably ping @jermp as to the relevant tradeoffs between the two approaches.

I think you covered the main qualities of the second approach quite well. In general, I can think of only one strong benefit of the first approach. In that case, the actual input to the hash function construction can be streaming. For example, you could imagine streaming a large file over the network and reading and sharding the keys directly into on-disk buckets. In that case, you don't even need to look at the input again (it could be on the network, or streamed in via a pipe or process substitution etc.). However, for constructing a really large hash, I don't know how realistic of a use-case scenario this is, or how desirable this property would be.

@vigna
Copy link
Contributor

vigna commented Nov 12, 2023

SigStore for sux-rs does exactly this, and it's ready to use. I build regularly static functions with 10^11 keys in less than three hours and little memory usage beyond the function itself. The store keeps track of 128-bit keys, but I guess that should be enough for PTRHash.

@RagnarGrootKoerkamp
Copy link
Owner

Ok, I'll try to implement option 2.

Indeed, memory usage for this should be limited to size of the output datastructure plus the size of shard.

@vigna with 'this' you also meant option 2 right?

@vigna
Copy link
Contributor

vigna commented Nov 12, 2023

You can do both with SigStore.

@RagnarGrootKoerkamp
Copy link
Owner

RagnarGrootKoerkamp commented Nov 12, 2023

With SigStore you mean this right:
https://github.com/vigna/sux-rs/blob/main/src/utils/sig_store.rs
(and not https://www.sigstore.dev/).

Would be helpful if you could publish a version to crates.io; I don't think crates.io accepts git dependencies in case I want to push there myself.

@vigna
Copy link
Contributor

vigna commented Nov 12, 2023

I know. sux should be published real soon...

@jermp
Copy link

jermp commented Nov 12, 2023

  • Read keys once, shard hashes into multiple partitions, where each partition is written to its own file.
  • Read keys multiple times, and in each pass only process a subset of keys.

I think, Option 1 is better (and more elegant) for real-world key sets of, say, billions of keys. PTHash-HEM implements this strategy and it works very well: keys are hashed once and for all and hashes written to disk in partitions. Then we load into memory as many partition as we can without exceeding the user's specified RAM limit. Say we can load T partitions: we process each partition in parallel. So we use T threads for building. Otherwise we go for a monolithic MPHF but much more involved parallel construction algorithms.

@RagnarGrootKoerkamp
Copy link
Owner

RagnarGrootKoerkamp commented Nov 12, 2023

I think we should be precise about exactly how many keys we are all thinking of :)

Each shard should be able to handle something like 4 billion keys. That means that for <=10-40 billion keys, option 2 of reading multiple times sounds pretty neat to me, as the MPHF itself will fit in memory anyway.

But indeed once the number of keys becomes much larger (>=40 billion) this sounds infeasible and writing hashes to disk becomes nicer. At this point one would probably also want to put the MPHF itself to disk.

Note that PTRHash uses relatively small partitions of ~200k keys already, so each shard would contain many parts and parallellization should be independent of the chosen sharding method.

@RagnarGrootKoerkamp
Copy link
Owner

RagnarGrootKoerkamp commented Nov 13, 2023

I've pushed support for both sharding options now. By default it uses 2^33 keys/shard corresponding to roughly 32GB of memory.

Set PTRHashParams::shard_to_disk to false for multiple passes in memory, or to true to write each shard to a file in /tmp/ (or to inside the TMPDIR environment variable if set).
See here for tests that use it.

So I think anything up to 2^40 keys should actually work now (apart from 128bit hashing, #4), as long as you have enough space to store the ~2.5 bits/key needed for the main datastructure itself with enough buffer left to process each shard.
I've only tested up to 2^33 so far though -- turns out my harddrive is basically O(1)*main memory size 😅 .

@rob-p
Copy link
Author

rob-p commented Nov 13, 2023

I don't have a meaningful comment here other than bravo! That was super quick ;P. We can test the 2 strategies now.

Any thoughts on when 128-bit hash support might hit? Then we can index all the things!

@RagnarGrootKoerkamp
Copy link
Owner

Thanks!

128bit hashes will come tomorrow. It's basically trivial to add once I find a good default hash function.
Mostly have to think about whether I want to also keep supporting 64bit hashes in some way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants