Feature request : external memory construction / partitioned hash #1

rob-p · 2023-09-22T14:14:50Z

Not sure if you are considering feature requests, so if not please just let me know and I won't bug you here until you're ready ;P.

One huge use case for MPHFs is the case when you cannot fut all of the keys in RAM in their naive representation (e.g. kmers from a compacted dBG). In this case, it's an essential feature of MPHF construction that keys can reside on disk and be visited in multiple passes for construction. Alternatively (or additionally), I know the C++ version of pthash also allows partitioned MPHFs where a top-level function maps the key space to separate distinct MPHFs (this need not be explicit and could be e.g. the minimizer of a kmer). This allows working with independent subsets of the keys at a time.

It would be great to have such a feature in the rust implementation of pthash!

RagnarGrootKoerkamp · 2023-09-22T14:27:42Z

Yeah, it would for sure be good to support this, but it will take time.

Most likely I'll be excited about this for a few weeks and then slowly it turns into a chore, but at that time pull requests will also be welcome of course :)

Anyway, feel free to make all the feature requests you can think of -- having new things to work on is always good and if I don't then documenting that fact also doesn't hurt :P

RagnarGrootKoerkamp · 2023-11-11T23:48:43Z

Allright so as said, this is now ready for testing. Re. this specific issue:

As you wrote, there are two options here:

Read keys once, shard hashes into multiple partitions, where each partition is written to its own file.
Read keys multiple times, and in each pass only process a subset of keys.

The current architecture is partitioned by default into ~250k sized chunks, so both options could work. To me option 2 sounds nicer as it avoids writing a lot of disk space that's only read once, and especially when keys are simply <=64bit kmers, streaming files and hashing kmers will probably be fast anyway.

Specifically, for option 2 you could stream all keys and store only those whose hash belongs in one of the first K parts, where K is such that they are expected to occupy ~32GB of memory (i.e. ~4*10^9 elements).
As long as the total number of keys is not orders of magnitude more than that, streaming the keys ~10 times sounds reasonable.

Do you have any insight in tradeoffs between the options? Option 2 should be straightforward to support by creating a function that constructs from a clonable iterator, while option 1 sounds a bit more involved.

rob-p · 2023-11-12T00:13:46Z

So I think from the perspective of a user of this library, either of the options is likely to work well. I'd probably ping @jermp as to the relevant tradeoffs between the two approaches.

I think you covered the main qualities of the second approach quite well. In general, I can think of only one strong benefit of the first approach. In that case, the actual input to the hash function construction can be streaming. For example, you could imagine streaming a large file over the network and reading and sharding the keys directly into on-disk buckets. In that case, you don't even need to look at the input again (it could be on the network, or streamed in via a pipe or process substitution etc.). However, for constructing a really large hash, I don't know how realistic of a use-case scenario this is, or how desirable this property would be.

vigna · 2023-11-12T07:28:20Z

SigStore for sux-rs does exactly this, and it's ready to use. I build regularly static functions with 10^11 keys in less than three hours and little memory usage beyond the function itself. The store keeps track of 128-bit keys, but I guess that should be enough for PTRHash.

RagnarGrootKoerkamp · 2023-11-12T09:31:02Z

Ok, I'll try to implement option 2.

Indeed, memory usage for this should be limited to size of the output datastructure plus the size of shard.

@vigna with 'this' you also meant option 2 right?

vigna · 2023-11-12T11:40:43Z

You can do both with SigStore.

RagnarGrootKoerkamp · 2023-11-12T11:47:22Z

With SigStore you mean this right:
https://github.com/vigna/sux-rs/blob/main/src/utils/sig_store.rs
(and not https://www.sigstore.dev/).

Would be helpful if you could publish a version to crates.io; I don't think crates.io accepts git dependencies in case I want to push there myself.

vigna · 2023-11-12T11:52:51Z

I know. sux should be published real soon...

jermp · 2023-11-12T18:12:42Z

Read keys once, shard hashes into multiple partitions, where each partition is written to its own file.

Read keys multiple times, and in each pass only process a subset of keys.

I think, Option 1 is better (and more elegant) for real-world key sets of, say, billions of keys. PTHash-HEM implements this strategy and it works very well: keys are hashed once and for all and hashes written to disk in partitions. Then we load into memory as many partition as we can without exceeding the user's specified RAM limit. Say we can load T partitions: we process each partition in parallel. So we use T threads for building. Otherwise we go for a monolithic MPHF but much more involved parallel construction algorithms.

RagnarGrootKoerkamp · 2023-11-12T18:43:53Z

I think we should be precise about exactly how many keys we are all thinking of :)

Each shard should be able to handle something like 4 billion keys. That means that for <=10-40 billion keys, option 2 of reading multiple times sounds pretty neat to me, as the MPHF itself will fit in memory anyway.

But indeed once the number of keys becomes much larger (>=40 billion) this sounds infeasible and writing hashes to disk becomes nicer. At this point one would probably also want to put the MPHF itself to disk.

Note that PTRHash uses relatively small partitions of ~200k keys already, so each shard would contain many parts and parallellization should be independent of the chosen sharding method.

RagnarGrootKoerkamp · 2023-11-13T17:24:47Z

I've pushed support for both sharding options now. By default it uses 2^33 keys/shard corresponding to roughly 32GB of memory.

Set PTRHashParams::shard_to_disk to false for multiple passes in memory, or to true to write each shard to a file in /tmp/ (or to inside the TMPDIR environment variable if set).
See here for tests that use it.

So I think anything up to 2^40 keys should actually work now (apart from 128bit hashing, #4), as long as you have enough space to store the ~2.5 bits/key needed for the main datastructure itself with enough buffer left to process each shard.
I've only tested up to 2^33 so far though -- turns out my harddrive is basically O(1)*main memory size 😅 .

rob-p · 2023-11-13T18:47:00Z

I don't have a meaningful comment here other than bravo! That was super quick ;P. We can test the 2 strategies now.

Any thoughts on when 128-bit hash support might hit? Then we can index all the things!

RagnarGrootKoerkamp · 2023-11-13T19:47:42Z

Thanks!

128bit hashes will come tomorrow. It's basically trivial to add once I find a good default hash function.
Mostly have to think about whether I want to also keep supporting 64bit hashes in some way.

RagnarGrootKoerkamp mentioned this issue Nov 12, 2023

Feature request: ε-serde support #2

Closed

RagnarGrootKoerkamp closed this as completed Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request : external memory construction / partitioned hash #1

Feature request : external memory construction / partitioned hash #1

rob-p commented Sep 22, 2023

RagnarGrootKoerkamp commented Sep 22, 2023

RagnarGrootKoerkamp commented Nov 11, 2023 •

edited

rob-p commented Nov 12, 2023

vigna commented Nov 12, 2023

RagnarGrootKoerkamp commented Nov 12, 2023

vigna commented Nov 12, 2023

RagnarGrootKoerkamp commented Nov 12, 2023 •

edited

vigna commented Nov 12, 2023

jermp commented Nov 12, 2023

RagnarGrootKoerkamp commented Nov 12, 2023 •

edited

RagnarGrootKoerkamp commented Nov 13, 2023 •

edited

rob-p commented Nov 13, 2023

RagnarGrootKoerkamp commented Nov 13, 2023

Feature request : external memory construction / partitioned hash #1

Feature request : external memory construction / partitioned hash #1

Comments

rob-p commented Sep 22, 2023

RagnarGrootKoerkamp commented Sep 22, 2023

RagnarGrootKoerkamp commented Nov 11, 2023 • edited

rob-p commented Nov 12, 2023

vigna commented Nov 12, 2023

RagnarGrootKoerkamp commented Nov 12, 2023

vigna commented Nov 12, 2023

RagnarGrootKoerkamp commented Nov 12, 2023 • edited

vigna commented Nov 12, 2023

jermp commented Nov 12, 2023

RagnarGrootKoerkamp commented Nov 12, 2023 • edited

RagnarGrootKoerkamp commented Nov 13, 2023 • edited

rob-p commented Nov 13, 2023

RagnarGrootKoerkamp commented Nov 13, 2023

RagnarGrootKoerkamp commented Nov 11, 2023 •

edited

RagnarGrootKoerkamp commented Nov 12, 2023 •

edited

RagnarGrootKoerkamp commented Nov 12, 2023 •

edited

RagnarGrootKoerkamp commented Nov 13, 2023 •

edited