Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: ε-serde support #2

Closed
vigna opened this issue Oct 1, 2023 · 17 comments
Closed

Feature request: ε-serde support #2

vigna opened this issue Oct 1, 2023 · 17 comments

Comments

@vigna
Copy link
Contributor

vigna commented Oct 1, 2023

It would be fantastic if the implementation could support ε-serde as that would make a breeze to map into memory large PTHash instances (or load them with transparent huge pages). For that however you would need a supporting version of EF, which you can find in sux-rs, albeit the library is kinda in progress.

In any case, once there's a working version we will try to pull together a PR.

I spoke with Giulio in July about a Rust port of PTHash and I'm really happy someone is working on this. There's presently nothing better in the MPHF field. We're trying to move the Software Heritage infrastructure to Rust and the missing piece is a good MPHF implementation for very large key sets.

@RagnarGrootKoerkamp
Copy link
Owner

Hey! I saw read your r/rust post on epserde yesterday :) Good stuff.

I read the readme more closely now and if I understand correctly the idea is that most big chunks of data live in mapped memory, while metadata lives in actual memory?
I'm happy to give this a try once the library is more mature, but also happy to work together on it or to accept a PR. (Anyway I expect my motivation to drop a bit once the low level algorithmic part is done and it moves more to infrastructure, so at that point some external input would probably be helpful ;)

@vigna
Copy link
Contributor Author

vigna commented Oct 1, 2023

Yes, exactly. We are a bit obsessed with performance so existing frameworks wouldn't do it for us.

Anyway, all you have to do is to use our EF and have it as a parameter. In fact, it is super easy to piggyback ε-serde once you're satisfied with the results.

@vigna
Copy link
Contributor Author

vigna commented Oct 1, 2023

And yes, we're absolutely available for collaboration. We even had a now dead pthash branch where @zommiommy was starting a port.

@RagnarGrootKoerkamp
Copy link
Owner

I think the code is now almost ready for usage (or at least testing) by others!

When you say 'very large key sets', exactly how large do you mean?
At <3bits/key, 64GB ram is sufficient for 20*10^9 keys, which sounds plenty to me, but maybe you're looking at even larger datasets?
In that case, my storage most relies on two vectors:

  1. A Vec<u8> of pilots; this should be simple to mmap (and I assume you already support this with ep-serde). Do you already support some kind of prefetching for this? (Or is that implicit via mmap?) Especially when reads are from (slow) disks, prefetching will be important for good throughput.
  2. A Vec<TinyEf> encoding 44 consecutive values into a cacheline. TineyEf is simply a size 64byte struct, so this should also be easy to support I suppose.

Specifically:

  • What are your keys? 64bit ints, or strings of >8 chars?
  • Up to how many keys do you have?

@vigna
Copy link
Contributor Author

vigna commented Nov 12, 2023

When you say 'very large key sets', exactly how large do you mean?

Well, as a test a trillion integer keys wouldn't be bad! :)

  1. A Vec<u8> of pilots; this should be simple to mmap (and I assume you already support this with ep-serde). Do you already support some kind of prefetching for this? (Or is that implicit via mmap?) Especially when reads are from (slow) disks, prefetching will be important for good throughput.

Do you mean at construction time or after deserialization?

  1. A Vec<TinyEf> encoding 44 consecutive values into a cacheline. TineyEf is simply a size 64byte struct, so this should also be easy to support I suppose.

It looks like a ZeroCopy struct, so that should be trivial.

  • What are your keys? 64bit ints, or strings of >8 chars?

It really depends—but I guess the only thing you need is a trait for things that can be hashed, right? Like ToSig in sux-rs.

  • Up to how many keys do you have?

Software Heritage has a few dozen billions, but I routinely test with hundreds of billions and for an MPH a trillion is reachable with 1TB of RAM.

@vigna
Copy link
Contributor Author

vigna commented Nov 12, 2023

Actually, our main problem is nightly. How difficult do you think it would be to make it work with stable?

@vigna
Copy link
Contributor Author

vigna commented Nov 12, 2023

Well, also, it does not compile on Apple Silicon. :)

@vigna
Copy link
Contributor Author

vigna commented Nov 12, 2023

Ouch. 2^32 max keys. I just read now. :(

@RagnarGrootKoerkamp
Copy link
Owner

Thanks for trying it out!

  1. The vec<u8> is the main datastructure resulting from construction that needs to be deserialized for queries.
  2. Yes, TinyEf is just simple data, so mmapping that should be simple indeed.
  3. So currently I only support in-ram construction, but see Feature request : external memory construction / partitioned hash #1 for that.
  4. Also, I only do 64-bit hashes currently. As long as your keys are <=64bit, this hash is invertible and collision free, and things will probably work for >2^32 keys once I add construction using external memory. But if you use longer string keys, I expect 128bit hashes will be very much needed. Could you be a bit more specific than 'it depends'? ;)
  5. Yeah TinyEf uses this cool PDEP instruction to find the kth 1-bit in a mask to implement select() efficiently. I'll add a fallback implementation for other platforms.

@zommiommy
Copy link

we also use pdep in our elias-fano and for portability we wrote https://github.com/zommiommy/common_traits/blob/main/src/select_in_word.rs

@RagnarGrootKoerkamp
Copy link
Owner

@zommiommy Yes I was just looking at it and trying to use it as well. But I'm still affected by zommiommy/common_traits#1 so I can't.

(Side note: I think common_traits is a bit too generic of a name, and not very descriptive of it's usage either. But might be too late to change now.)

@zommiommy
Copy link

ok, I'll fix it asap, yeah naming things is still the hardest problem in computer science lol

@vigna
Copy link
Contributor Author

vigna commented Nov 12, 2023

Probably the actual traits (LimitedSizeInteger etc.) should be in common_traits, but the rest elsewhere—it's really algorithmic code, not traits in the sense of "trait design".

@vigna
Copy link
Contributor Author

vigna commented Nov 12, 2023

  1. The vec<u8> is the main datastructure resulting from construction that needs to be deserialized for queries.

Yep, I'm already able to serialize and deserialize (I'll update you later) but now I must implement Packed for &[u8] or the deserialized structure won't work.

  1. So currently I only support in-ram construction, but see Feature request : external memory construction / partitioned hash #1 for that.

Yep, I commented there that probably SigStore is what you want.

  1. Also, I only do 64-bit hashes currently. As long as your keys are <=64bit, this hash is invertible and collision free, and things will probably work for >2^32 keys once I add construction using external memory. But if you use longer string keys, I expect 128bit hashes will be very much needed. Could you be a bit more specific than 'it depends'? ;)

It depends on the application—hashes should work on any kind of data. SWH uses strings thou. In fact minimal perfect hashes of integers are a rare use case in my experience.

If I understand correctly, you have a bijection u64 <-> u64 that you use "as if" the output was random. For strings, etc. you'll need to start from 128-bit hashes—that's how VFunc in Rust and all other various minimal perfect hashes in Sux4J are implemented.

@vigna
Copy link
Contributor Author

vigna commented Nov 12, 2023

  1. Yeah TinyEf uses this cool PDEP instruction to find the kth 1-bit in a mask to implement select() efficiently. I'll add a fallback implementation for other platforms.

You are invited to copy the code in common_traits :).

@RagnarGrootKoerkamp
Copy link
Owner

RagnarGrootKoerkamp commented Nov 12, 2023

  • I removed all dependencies on nightly features, so it should now build on stable.
  • I copied over select_in_word, so compilation should work on macbook as well now.

Re keys:

  • So as long as keys fit in 64 bits the default hash function (and probably most hash functions) are invertible. This is sufficient when working with up to 32-long kmers.
  • But yes, I'll need to add an 128-bit hash variant as well. Will work on it.

@RagnarGrootKoerkamp
Copy link
Owner

Closed by #3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants