-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: ε-serde support #2
Comments
Hey! I saw read your r/rust post on epserde yesterday :) Good stuff. I read the readme more closely now and if I understand correctly the idea is that most big chunks of data live in mapped memory, while metadata lives in actual memory? |
Yes, exactly. We are a bit obsessed with performance so existing frameworks wouldn't do it for us. Anyway, all you have to do is to use our EF and have it as a parameter. In fact, it is super easy to piggyback ε-serde once you're satisfied with the results. |
And yes, we're absolutely available for collaboration. We even had a now dead pthash branch where @zommiommy was starting a port. |
I think the code is now almost ready for usage (or at least testing) by others! When you say 'very large key sets', exactly how large do you mean?
Specifically:
|
Well, as a test a trillion integer keys wouldn't be bad! :)
Do you mean at construction time or after deserialization?
It looks like a ZeroCopy struct, so that should be trivial.
It really depends—but I guess the only thing you need is a trait for things that can be hashed, right? Like ToSig in sux-rs.
Software Heritage has a few dozen billions, but I routinely test with hundreds of billions and for an MPH a trillion is reachable with 1TB of RAM. |
Actually, our main problem is nightly. How difficult do you think it would be to make it work with stable? |
Well, also, it does not compile on Apple Silicon. :) |
Ouch. 2^32 max keys. I just read now. :( |
Thanks for trying it out!
|
we also use pdep in our elias-fano and for portability we wrote https://github.com/zommiommy/common_traits/blob/main/src/select_in_word.rs |
@zommiommy Yes I was just looking at it and trying to use it as well. But I'm still affected by zommiommy/common_traits#1 so I can't. (Side note: I think |
ok, I'll fix it asap, yeah naming things is still the hardest problem in computer science lol |
Probably the actual traits (LimitedSizeInteger etc.) should be in common_traits, but the rest elsewhere—it's really algorithmic code, not traits in the sense of "trait design". |
Yep, I'm already able to serialize and deserialize (I'll update you later) but now I must implement Packed for &[u8] or the deserialized structure won't work.
Yep, I commented there that probably SigStore is what you want.
It depends on the application—hashes should work on any kind of data. SWH uses strings thou. In fact minimal perfect hashes of integers are a rare use case in my experience. If I understand correctly, you have a bijection u64 <-> u64 that you use "as if" the output was random. For strings, etc. you'll need to start from 128-bit hashes—that's how VFunc in Rust and all other various minimal perfect hashes in Sux4J are implemented. |
You are invited to copy the code in common_traits :). |
Re keys:
|
Closed by #3 |
It would be fantastic if the implementation could support ε-serde as that would make a breeze to map into memory large PTHash instances (or load them with transparent huge pages). For that however you would need a supporting version of EF, which you can find in sux-rs, albeit the library is kinda in progress.
In any case, once there's a working version we will try to pull together a PR.
I spoke with Giulio in July about a Rust port of PTHash and I'm really happy someone is working on this. There's presently nothing better in the MPHF field. We're trying to move the Software Heritage infrastructure to Rust and the missing piece is a good MPHF implementation for very large key sets.
The text was updated successfully, but these errors were encountered: