Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: looser lifetime for loading from byte slice #92

Closed
rw opened this issue Jan 13, 2020 · 10 comments
Closed

Request: looser lifetime for loading from byte slice #92

rw opened this issue Jan 13, 2020 · 10 comments

Comments

@rw
Copy link

rw commented Jan 13, 2020

Currently, the primary way to load an fst object (Map/Set/Raw) from a byte slice is to use from_static_slice. While that solves the use case of loading data that lives for the lifetime of the program, it appears to be overly-restrictive for other use cases.

For example, I have fst data that I've written into a network buffer:

  • I do not want to give ownership of that buffer to fst, so using from_bytes won't work.
  • The fst data is not 'static, so from_static_slice won't work.
  • The fst data is a subslice of a greater slice (the network buffer), so to make it work with from_shared_bytes I would need to change my internal APIs to use Arc<Vec<u8>> instead of &[u8] (which may add bugs and performance regressions) .

While I could try to use the existing mmap API, I believe I would need to create a file-like object that wraps my byte slice, and then memory map it, which is unergonomic and error-prone.

My recommendation is to add a from_slice(bytes: &[u8]) constructor to Map, Set, and Raw. My understanding is that this is unfortunately not trivial, because the FstData type does not take a lifetime parameter, which would be required while it holds a reference to the byte slice.

Nevertheless, is there interest in adding this feature?

Thanks!

@BurntSushi
Copy link
Owner

My recommendation is to add a from_slice(bytes: &[u8]) constructor to Map, Set, and Raw. My understanding is that this is unfortunately not trivial, because the FstData type does not take a lifetime parameter, which would be required while it holds a reference to the byte slice.

Right. This is basically why it's not provided. That lifetime parameter will then infect everything in the public APIs of this crate. IIRC, when I was i initially designing the API, I could not get it to work. This was years ago, so I don't remember precisely why, but I wouldn't be surprised if there was some poor interaction between said lifetime parameter and streaming iterators.

On the other hand, if you could get it to work, then I might be willing to entertain that, since I recognize that the current API is a bit too restrictive in that sense. With that said, if that extra lifetime parameter leads to a lot of additional API complexity, then I'm not sure if it's worth it.

Another, more lofty, idea I've had for a while is to perhaps abandon the formulation of a streaming iterator in this crate as it exists now, and perhaps use something more like a cursor API. But I haven't tried this yet, so I don't know how well it would work. And it's a much bigger change.

@BurntSushi
Copy link
Owner

I do not want to give ownership of that buffer to fst, so using from_bytes won't work.

Could you say more about why? Certainly, you could "just" clone the bytes and give fst ownership over that. You'd suffer the additional allocation, but it's not clear why that's a deal breaker here. I could envision some scenarios where this would use more memory than necessary (e.g., if the original byte slice needs to continue living). In terms of time though, I'd imagine that FST construction itself would dwarf the time it takes to copy the bytes.

@rw
Copy link
Author

rw commented Jan 13, 2020

Could you say more about why? Certainly, you could "just" clone the bytes and give fst ownership over that.

@BurntSushi I have many fsts and they are short-lived. Their lifecycles are short enough that we would find it onerous to allocate a copy of the fst data each time we want to read them. They aren't stored in files.

@BurntSushi
Copy link
Owner

I'm just trying to brainstorm work arounds for you. Doing that requires understanding more about what you're doing. Because even if there is a fix for this, I don't see it happening any time soon.

How big are your fsts?

@rw
Copy link
Author

rw commented Jan 14, 2020

@BurntSushi Thanks for trying to help me find a quick solution! My fsts have single-digit millions of elements, and the size of the underlying data is between 10MiB and 1GiB (with a median around 100MiB). I haven't benchmarked how much compression I get from fst generation. I see two paths to an immediate solution, neither of which require changes to the fst library:

  1. Use Vecs. This is safe but causes allocator pressure.
  2. Unsafely modify the type of my byte slice to have the 'static lifetime, and then provide that to the from_static_slice method. By using a wrapper struct that holds a reference to the original byte slice, as well as to the fst object, I think the correctness argument is straightforward.

I think I like (2), because it's simple and high-performance. WDYT?

@BurntSushi
Copy link
Owner

Yeah, I would probably go with (2) if trying (1) did indeed produce a suboptimal result.

@rw
Copy link
Author

rw commented Jan 14, 2020

Thanks!

@Kerollmops
Copy link
Contributor

Just to help a little this conversation to move forward, I have tried to clone an FST of 400MiB each time I needed to search in it and it was around 200-400ms. I then made an unsafe transmute to make my slice appear like a static one and it was 50us to load it from LMDB. This is unsafe but it work great.

@BurntSushi
Copy link
Owner

I think for now, the best solution to this problem is probably to unfortunately use unsafe. Transmuting a &'a [u8] to a &'static [u8] should be safe so long as you ensure that the original source of the slice doesn't outlive the &'static [u8] slice.

I'll continue to brainstorm about this, but I think both of the solutions discussed above are major changes to the library that are unlikely to happened, especially given that I'm not sure if they would work or not. In other words, I think something this is only going to get fixed if someone builds a new library or if i happen to get motivated to do some lengthy experimentation. So I'm going to close this for now.

@BurntSushi
Copy link
Owner

OK, I'm going to try fixing this by making FSTs generic over AsRef<[u8]>. Tantivy's fork of fst does this, so now I know it's possible. That would fix this issue and also partially address the reason for Tantivy's fork in the first place.

BurntSushi added a commit that referenced this issue Feb 24, 2020
This change was not nearly as bad as I thought it would be. Instead of
trying to provide all the different possible ways to store some bytes,
we instead make FSTs maximally flexible by accepting any type that can
cheaply provide byte slice. This should resolve a number of issues with
folks constructing FSTs in ways that weren't supported by the old
constructors.

As a bonus, we no longer need to directly depend on a specific
implementation of memory maps. Conveniently, the `Mmap` type in the
`memmap` crate already implements `AsRef<[u8]>`, so using memory maps is
as simple as

    let mmap = memmap::Mmap::map(&File::open("something.fst").unwrap());
    let fst = Fst::new(mmap).unwrap();

Fixes #92, Fixes #94, Fixes #97
BurntSushi added a commit that referenced this issue Feb 25, 2020
This change was not nearly as bad as I thought it would be. Instead of
trying to provide all the different possible ways to store some bytes,
we instead make FSTs maximally flexible by accepting any type that can
cheaply provide byte slice. This should resolve a number of issues with
folks constructing FSTs in ways that weren't supported by the old
constructors.

As a bonus, we no longer need to directly depend on a specific
implementation of memory maps. Conveniently, the `Mmap` type in the
`memmap` crate already implements `AsRef<[u8]>`, so using memory maps is
as simple as

    let mmap = memmap::Mmap::map(&File::open("something.fst").unwrap());
    let fst = Fst::new(mmap).unwrap();

Fixes #92, Fixes #94, Fixes #97
BurntSushi added a commit that referenced this issue Mar 6, 2020
This change was not nearly as bad as I thought it would be. Instead of
trying to provide all the different possible ways to store some bytes,
we instead make FSTs maximally flexible by accepting any type that can
cheaply provide byte slice. This should resolve a number of issues with
folks constructing FSTs in ways that weren't supported by the old
constructors.

As a bonus, we no longer need to directly depend on a specific
implementation of memory maps. Conveniently, the `Mmap` type in the
`memmap` crate already implements `AsRef<[u8]>`, so using memory maps is
as simple as

    let mmap = memmap::Mmap::map(&File::open("something.fst").unwrap());
    let fst = Fst::new(mmap).unwrap();

Fixes #92, Fixes #94, Fixes #97
BurntSushi added a commit that referenced this issue Mar 6, 2020
This change was not nearly as bad as I thought it would be. Instead of
trying to provide all the different possible ways to store some bytes,
we instead make FSTs maximally flexible by accepting any type that can
cheaply provide byte slice. This should resolve a number of issues with
folks constructing FSTs in ways that weren't supported by the old
constructors.

As a bonus, we no longer need to directly depend on a specific
implementation of memory maps. Conveniently, the `Mmap` type in the
`memmap` crate already implements `AsRef<[u8]>`, so using memory maps is
as simple as

    let mmap = memmap::Mmap::map(&File::open("something.fst").unwrap());
    let fst = Fst::new(mmap).unwrap();

Fixes #92, Fixes #94, Fixes #97
BurntSushi added a commit that referenced this issue Mar 7, 2020
This change was not nearly as bad as I thought it would be. Instead of
trying to provide all the different possible ways to store some bytes,
we instead make FSTs maximally flexible by accepting any type that can
cheaply provide byte slice. This should resolve a number of issues with
folks constructing FSTs in ways that weren't supported by the old
constructors.

As a bonus, we no longer need to directly depend on a specific
implementation of memory maps. Conveniently, the `Mmap` type in the
`memmap` crate already implements `AsRef<[u8]>`, so using memory maps is
as simple as

    let mmap = memmap::Mmap::map(&File::open("something.fst").unwrap());
    let fst = Fst::new(mmap).unwrap();

Fixes #92, Fixes #94, Fixes #97
BurntSushi added a commit that referenced this issue Mar 7, 2020
This change was not nearly as bad as I thought it would be. Instead of
trying to provide all the different possible ways to store some bytes,
we instead make FSTs maximally flexible by accepting any type that can
cheaply provide byte slice. This should resolve a number of issues with
folks constructing FSTs in ways that weren't supported by the old
constructors.

As a bonus, we no longer need to directly depend on a specific
implementation of memory maps. Conveniently, the `Mmap` type in the
`memmap` crate already implements `AsRef<[u8]>`, so using memory maps is
as simple as

    let mmap = memmap::Mmap::map(&File::open("something.fst").unwrap());
    let fst = Fst::new(mmap).unwrap();

Fixes #92, Fixes #94, Fixes #97
BurntSushi added a commit that referenced this issue Mar 7, 2020
This change was not nearly as bad as I thought it would be. Instead of
trying to provide all the different possible ways to store some bytes,
we instead make FSTs maximally flexible by accepting any type that can
cheaply provide byte slice. This should resolve a number of issues with
folks constructing FSTs in ways that weren't supported by the old
constructors.

As a bonus, we no longer need to directly depend on a specific
implementation of memory maps. Conveniently, the `Mmap` type in the
`memmap` crate already implements `AsRef<[u8]>`, so using memory maps is
as simple as

    let mmap = memmap::Mmap::map(&File::open("something.fst").unwrap());
    let fst = Fst::new(mmap).unwrap();

Fixes #92, Fixes #94, Fixes #97
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants