Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make string interner and backends compatible with more types #42

Closed
wants to merge 5 commits into from

Conversation

jedel1043
Copy link

@jedel1043 jedel1043 commented Feb 21, 2022

This PR rewrites the interner to be generic, making it compatible with many more types.

  • Now the interner explicitly needs a backend parameter, because the type checker cannot resolve &str as the input argument to str based backends. (There are more than one impl AsRef for the &str type). One solution would be to restrict T to Borrow, with the downside of making it less ergonomic to pass a &[String] to the extend method of the interner. I'd like to hear your ideas :)
  • Had to add a couple of custom trait bounds, one of them being unsafe by the nature of the BucketBackend.
  • Some bounds are overly restrictive, like using <S as ToOwned>::Owned as the buffer type of the StringBackend, but that made the API a bit more cleaner with less generic parameters. Would it be preferrable as it is or do you prefer more general types?

Comments are very much appreciated :)

@Robbepop
Copy link
Owner

Robbepop commented Feb 21, 2022

Hi and thanks for your PR! Looks interesting!
Usually for bigger changes it is nice to have an issue up before a PR emerges to allow for discussion beforehand.

I have yet to grasp what it actually provides users of this library in essence. So will it be possible to use C-style strings or OsString with StringInterner or is this even more generic? Some motivation where you describe the state of the string-interner crate, its limitations and problems and what this PR fixes (and evtl. how) would be very much approeciated.

I also conducted the compiled cargo doc of this PR and saw some new types. Most notably Span in backend::string and backend::bucket and the FixedString and FixedVec in backend::bucket. Would be nice if you could describe why this PR requires us to expand the API of the crate.

I also conducted some benchmarks using cargo bench and unfortunately found that this PR worsens the performance of the bucket backend and string backend significantly in some cases.

@jedel1043
Copy link
Author

jedel1043 commented Feb 21, 2022

Thank you for the quick response!

Usually for bigger changes it is nice to have an issue up before a PR emerges to allow for discussion beforehand.

Noted for the next time!

I have yet to grasp what it actually provides users of this library in essence. So will it be possible to use C-style strings or OsString with StringInterner or is this even more generic? Some motivation where you describe the state of the string-interner crate, its limitations and problems and what this PR fixes (and evtl. how) would be very much appreciated.

In Boa, a Javascript interpreter, we use string-interner to intern string literals embedded in the code. However, we're transitioning to an UTF-16 based string and this interner is currently unable to intern arbitrary [T] slices. (Specifically we need support for [u16] slices)
I thought of just providing a [u16] implementation PR and calling it a day, but since I saw that in #39 you had a discussion about allowing another string type, I just thought that it would be ideal to make the interner and the backends generic to any custom string.
This PR provides full capabilities for using the interner with any custom type, provided that the user implements the correct traits for its custom type. This also makes it very easy to extend the interner to be compatible with OsStr, CStr and some others, you just implement the required traits and that's it!.

I also conducted the compiled cargo doc of this PR and saw some new types. Most notably Span in backend::string and backend::bucket and the FixedString and FixedVec in backend::bucket. Would be nice if you could describe why this PR requires us to expand the API of the crate.

IIRC Span was already in backend::String, I just reordered a bit the exports because of the new trait associated with the BufferBackend, and to have symmetry with each backend.

The most notable additions to the API are FixedContainer, FixedVec and Sliced.

FixedContainer is a generalization of FixedString, a container fixed in capacity that doesn't reallocate on each push. Had to make it unsafe because there's currently no way to know whether a reallocation of the container occurred, so it must be responsibility of the user to guarantee that invariant.

FixedString already existed, just exposed it to be explicitly passable as a generic parameter for BufferBackend.

FixedVec is the one that allows the user to utilize the backend with any [T] slice, you just need to provide the type of the slice. It's pretty similar to FixedString, but without the cast from [u8] to str. Of course, the user could have implemented FixedVec by its own, but I think providing common containers for some std types would be better for ergonomics.
Also, I realized the trait could be implemented for the raw Vec and String types, but I left it as it is until I heard your thoughts about it.

Finally, Sliced is just a trait to represent a type T that is bidirectionally convertible to a [U] slice. This bound was needed to generalize the StringBackend and the BufferBackend. It isn't marked as unsafe because the implementor can assume that the provided input &[U] is a valid representation of T, since the backends only convert full strings between each representation. Implemented it for str and [T] and should be very easy to implement for more std string types.

I also conducted some benchmarks using cargo bench and unfortunately found that this PR worsens the performance of the bucket backend and string backend significantly in some cases.

That's... weird. Maybe the generics are adding additional indirection to the trait methods? Probably inlining should fix that.
May I know specifically which benchmarks have regressed?

@Robbepop
Copy link
Owner

Thanks for your elaborate explanations!

That's... weird. Maybe the generics are adding additional indirection to the trait methods? Probably inlining should fix that.
May I know specifically which benchmarks have regressed?

I think it is best if you just run the benchmarks on your own computer to get a feeling. Just checkout master branch, run cargo bench, then checkout your branch rerun cargo bench and see the difference immediate in your console.
The regressions on my system were quite significant in the range of 50%-100%. Maybe you are right that it can be fixed with inline.

With respect to the suggested API I am always looking for the simplest design. Yet, I think having all this flexibility with Sliced, FixedVec and exposing internal types such as Span won't get us there. The ideal API was to just introduce a single generic parameter to the Backend trait with a default to String (or str depending on the design) and maybe some trait that defines the behavior that must be fulfilled by such a String type that can be used by the string-interner crate. That's at least how I imagine the simplest API for this feature.

@jedel1043
Copy link
Author

jedel1043 commented Feb 21, 2022

I think it is best if you just run the benchmarks on your own computer to get a feeling. Just checkout master branch, run cargo bench, then checkout your branch rerun cargo bench and see the difference immediate in your console. The regressions on my system were quite significant in the range of 50%-100%. Maybe you are right that it can be fixed with inline.

Will do!

... and exposing internal types such as Span ...

Sorry! I Exposed Span by accident when I marked the whole module as pub. Will remove the keyword from the declaration.

With respect to the suggested API I am always looking for the simplest design. Yet, I think having all this flexibility with Sliced, FixedVec and exposing internal types such as Span won't get us there. The ideal API was to just introduce a single generic parameter to the Backend trait with a default to String (or str depending on the design) and maybe some trait that defines the behavior that must be fulfilled by such a String type that can be used by the string-interner crate. That's at least how I imagine the simplest API for this feature.

Ah, so unify the requirements of all backends into a single trait? I could do that for most of the backends.
However, buffer is another beast, because it requires handling specific invariants for its containers. I CAN make it require <S as ToOwned>::Owned: FixedContainer, which will make it able to pass str as a parameter, but I'd need to delete the safer FixedString for a direct implementation of the trait for String. I think that's okay, but I'd like to hear what you think.

If that's the plan, then the exposed types are a lot less; only FixedContainer and a new Internable trait encompassing all the requirements for all the backends. Is that okay?
Also, should I add the <S as ToOwned>::Owned: FixedContainer bound for the new Internable trait? or should I bound it only in the BufferBackend? I think the separation would be good, because implementing Internable is safe, while implementing FixedContainer is inherently unsafe.

@jedel1043
Copy link
Author

jedel1043 commented Feb 22, 2022

On another note, I had to adjust the allocation benchmarks for the BucketBackend, because calling shrink_to_fit results in undefined behaviour. The cause was the definition of shrink_to_fit:

fn shrink_to_fit(&mut self) {
self.spans.shrink_to_fit();
self.head.shrink_to_fit();
self.full.shrink_to_fit();
}

Calling self.head.shrink_to_fit was reallocating head and invalidating previously created NonNulls, so I removed that call:

fn shrink_to_fit(&mut self) {
self.spans.shrink_to_fit();
self.full.shrink_to_fit();
}

And obviously, fixing the bug increased the memory consumption, so I had to adjust the expected min and max memory used.

@Razican
Copy link

Razican commented Mar 18, 2022

@Robbepop did you have the chance to check the latest changes?

@Robbepop
Copy link
Owner

Robbepop commented Mar 19, 2022

@Robbepop did you have the chance to check the latest changes?

Sure but what does @jedel1043 think about the state of this PR?

If that's the plan, then the exposed types are a lot less; only FixedContainer and a new Internable trait encompassing all the requirements for all the backends. Is that okay?

I'd like for FixedContainer not to be exposed. What is the deal with it anyways?
The Internable trait should actually be called String since I suppose that it is implemented by string types in order to fit into the scheme.
I hope you can understand that I won't merge this if it introduces performance or memory usage regressions since I believe it is very well possible to not introduce those while making the StringInterner work for more string types.

On another note, I had to adjust the allocation benchmarks for the BucketBackend, because calling shrink_to_fit results in undefined behaviour. The cause was the definition of shrink_to_fit:

Can you elaborate on this? I don't see how this results in undefined behavior atm. The memory consumption regressions for the BucketBackend are extreme.

@jedel1043
Copy link
Author

Can you elaborate on this? I don't see how this results in undefined behavior atm. The memory consumption regressions for the BucketBackend are extreme.

Sure :)
The definition of the BucketBackend is as follows:

pub struct BucketBackend<S = DefaultSymbol> {
spans: Vec<InternedStr>,
head: FixedString,
full: Vec<String>,
marker: PhantomData<fn() -> S>,
}

This backend works by creating several FixedString buffers of a fixed size, storing them in a Vec<String> and creating NonNull<str> references (abstracted by InternedString) to each of them in order to obtain the interned strs.

However, FixedString must always ensure that it won't be moved to another memory location, otherwise the previously created NonNull<str> references will point to unallocated memory (use after free).

Unfortunately, when we call the shrink_to_fit method from the backend, it internally calls the shrink_to_fit method from FixedString:

/// Shrink capacity to fit the contents exactly.
pub fn shrink_to_fit(&mut self) {
self.contents.shrink_to_fit();
}

And calling shrink_to_fit on a String CAN move its contents to another memory location to reduce its allocated space. From the definition of Vec::shrink_to_fit:

    pub fn shrink_to_fit(&mut self) {
        // The capacity is never less than the length, and there's nothing to do when
        // they are equal, so we can avoid the panic case in `RawVec::shrink_to_fit`
        // by only calling it with a greater capacity.
        if self.capacity() > self.len {
            self.buf.shrink_to_fit(self.len);
        }
    }

This calls RawVec::shrink_to_fit:

    pub fn shrink_to_fit(&mut self, cap: usize) {
        handle_reserve(self.shrink(cap));
    }

Which calls RawVec::shrink:

    fn shrink(&mut self, cap: usize) -> Result<(), TryReserveError> {
        assert!(cap <= self.capacity(), "Tried to shrink to a larger capacity");


        let (ptr, layout) = if let Some(mem) = self.current_memory() { mem } else { return Ok(()) };


        let ptr = unsafe {
            // `Layout::array` cannot overflow here because it would have
            // overflowed earlier when capacity was larger.
            let new_layout = Layout::array::<T>(cap).unwrap_unchecked();
            self.alloc
                .shrink(ptr, layout, new_layout)
                .map_err(|_| AllocError { layout: new_layout, non_exhaustive: () })?
        };
        self.set_ptr_and_cap(ptr, cap);
        Ok(())
    }
}

Which calls Allocator::shrink, and in the implementation of shrink for Global, the default global allocator, in the branch corresponding to the case where new_layout.size() < old_layout.size():

            new_size => unsafe {
                let new_ptr = self.allocate(new_layout)?;
                ptr::copy_nonoverlapping(ptr.as_ptr(), new_ptr.as_mut_ptr(), new_size);
                self.deallocate(ptr, old_layout);
                Ok(new_ptr)
            },

Oops, the old memory is deallocated! This causes a use after free and results in undefined behaviour.

@jedel1043
Copy link
Author

I'd like for FixedContainer not to be exposed.

I don't think that's possible. You cannot require that a generic type ought to be an implementor of a certain trait without exposing said trait. The alternative would be to maintain a different definition of each backend for each string type, which is not ideal.

The Internable trait should actually be called String since I suppose that it is implemented by string types in order to fit into the scheme.

I don't think is a good idea to name a trait as a specific type of the standard library. Aside from the name conflict, it would be very confusing if a trait, which is implemented by types, has the same name as a specific type. What about InternableString?

@jedel1043
Copy link
Author

jedel1043 commented Apr 27, 2022

@Robbepop Did you have the chance to read my explanation and comments?
Also, if you're still unsure about this PR, I could reduce its complexity by completely removing the ability to use the interner with custom user types, making it only able to pass a certain list of types (mainly str, [T], CStr, Path, OStr).

bors bot pushed a commit to boa-dev/boa that referenced this pull request Jun 30, 2022
…rner` (#2147)

So, @raskad and myself had a short discussion about the state of #736, and we came to the conclusion that it would be a good time to implement our own string interner; partly because the `string-interner` crate is a bit unmaintained (as shown by Robbepop/string-interner#42 and Robbepop/string-interner#47), and partly because it would be hard to experiment with custom optimizations for UTF-16 strings. I still want to thank @Robbepop for the original implementation though, because some parts of this design have been shamelessly stolen from it 😅.

Having said that, this PR is a complete reimplementation of the interner, but with some modifications to (hopefully!) make it a bit easier to experiment with UTF-16 strings, apply optimizations, and whatnot :)
@Robbepop
Copy link
Owner

Robbepop commented May 1, 2024

I think I would like to keep the StringInterner implementation as simple as possible given that it had some UB findings in the past and I want to avoid to repeat that. If we are able to find a generic implementation that is kind of simple enough I am happy to re-evaluate this.

@Robbepop Robbepop closed this May 1, 2024
@jedel1043 jedel1043 deleted the generic_backends branch May 1, 2024 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants