Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: For what functionality does bstr need regex-automata and lazy-static? #53

Closed
tbu- opened this issue Apr 28, 2020 · 6 comments
Closed
Milestone

Comments

@tbu-
Copy link

tbu- commented Apr 28, 2020

unicode = ["lazy_static", "regex-automata"]

I'm considering using this crate, but it seems to have quite some heavy dependencies by default (I'm aware that I can turn it off). What are these dependencies used for?

@BurntSushi
Copy link
Owner

For Unicode handling, as the feature name suggests. :-) It's also documented in the README.

lazy_static isn't particularly heavy. regex-automata might be, but its default features are disabled. Its only non-optional dependency is byteorder. When regex-automata is compiled without its default features, it becomes quite light-weight. All it will have is the DFA search runtime. All the DFA building code falls away.

If you're looking for more specifics, then regex-automata is used to implement the grapheme/word/sentence segmentation algorithms. For example, here's the regex for grapheme segmentation. Those regexes are compiled into DFAs and embedded into the executable. They are then loaded via lazy_static.

@BurntSushi
Copy link
Owner

BurntSushi commented Apr 28, 2020

I don't see a ton of room for improvement here to be honest. Pretty much any kind of Unicode handling is always going to require a bit of fat somewhere. In this case, regex-automata is no worse and perhaps even better than what unicode-segmentation does. Namely, there are no separate Unicode tables. Instead, everything is built right into the automaton, which is also minimized (via Hopcroft). Combined with using a sparse representation (which bstr does for the bigger regexes), I'm pretty sure you're getting pretty close to the minimal amount of space needed to implement these algorithms. The trade off here is that there's an extra dependency that you see compiling. I am generally pretty sympathetic to that concern, which is why I've spent a lot of time keeping my dependency trees small, but it is not something that I optimize for at the expense of everything else.

I think moving forward, there is some potential for removing the lazy_static and byteorder dependencies. I'm already exploring the removal of the latter in the 0.2 release of regex-automata, since I will be bumping the MSRV to Rust 1.36 (which includes the endian/integer conversion routines added to std).

lazy_static I think will be trickier to remove. In theory, a sufficiently expressive const fn feature should be enough, since loading a DFA into memory is by design simple, cheap and pure with no allocation. The other possibility is if lazy types get added to std, then those could be used instead.

In theory, memchr could also be made optional, likely at the cost of a significant performance decrease in almost all searching routines in the vast majority of common cases.

@tbu-
Copy link
Author

tbu- commented Apr 28, 2020

Thanks for the thorough answer.

I'm looking for a small crate that makes dealing with almost UTF8 strings nicer, so that I can work with them like with std's str type.

From the crate documentation:

This library bundles in a few more Unicode operations, such as grapheme, word and sentence iterators. More operations, such as normalization and case folding, may be provided in the future.

Take the following from a not-yet user, not really informed about the history of this crate: Could these perhaps be disabled by default so that this crate is more of a drop-in replacement for the standard library's str type?

@tbu-
Copy link
Author

tbu- commented Apr 28, 2020

I went to the top ten reverse dependencies of bstr, the first one is your crate csv:

https://github.com/BurntSushi/rust-csv/blob/70c8600b29349f9ee0501577284d8300ae9c8055/Cargo.toml

Does it use the unicode features of bstr?

The ripgrep-related crates probably use those. Other than your crates, I only see rlua (which does manage to disable default dependencies) and cargo-release (which does not disable the default dependency, but I guess it doesn't use the unicode data either).

@BurntSushi
Copy link
Owner

Could these perhaps be disabled by default so that this crate is more of a drop-in replacement for the standard library's str type?

I think this is more of a philosophical stance, right? If I were, to say, embed the DFA search runtime from regex-automata into bstr and thereby remove the dependencies on regex-automata and byteorder (but probably would need to keep lazy_static), would you still be asking this question? If not, then why not?

Switching the defaults is something I'm possibly open to. (And in particular, this is good timing, since I hope to put out a bstr 1.0 release sometime soonish, and changing this default is a breaking change.) The main problem I have with that is that I'd rather surface Unicode-aware APIs by default. While regex-automata is primarily used to implement the Unicode segmentation algorithms, it's also used in other places, like for the implementation of trim. I also partially anticipate that it may be used for other things such as case conversion, although I'm not sure. If trim were the only concern, then that could be implemented without using regex-automata.

Generally my view here was that things like graphemes, words and sentences should be available by default because---especially graphemes---they are often what you want for correctness reasons. Indeed, part of the motivation for bstr is to serve as a single one-stop-shop for these sorts of Unicode APIs. And this is interwoven with the idea of providing UTF-8-by-convention APIs, because most of the Unicode algorithms in the Rust ecosystem are implemented on &str and it's really hard to adapt or use them on &[u8].

I think at a high level, I feel like "Unicode by default" is the philosophically better choice in general. regex makes the same choice: everything is Unicode aware by default. Because text is hard and people get it wrong and consistently forget about corner cases. This means that folks who are aware of the trade off and care about slimming their dependency tree will need to take explicit action, and I guess I kind of feel like that's OK. I'd rather that than people who don't posses a deep understanding of text missing out. I grant that I'm being hand wavy here, but it's because reasonable people can probably disagree about what the right default is in this case.

Does it use the unicode features of bstr?

Yes, it uses bstr to trim whitespace via Unicode's White_Space property: https://github.com/BurntSushi/rust-csv/blob/70c8600b29349f9ee0501577284d8300ae9c8055/src/byte_record.rs#L374

(I have been considering removing the use of bstr from csv, since its dependency tree has gotten much bigger than I'd like, and I think the White_Space property is small enough where its handling can just be inlined.)

@BurntSushi
Copy link
Owner

I'm going to close this out. I still fee largely the same as I did when I wrote my comments above, and I don't see it changing necessarily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants