Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: remove configuration knobs #57

Closed
BurntSushi opened this issue Mar 28, 2020 · 6 comments · Fixed by #99
Closed

RFC: remove configuration knobs #57

BurntSushi opened this issue Mar 28, 2020 · 6 comments · Fixed by #99

Comments

@BurntSushi
Copy link
Owner

BurntSushi commented Mar 28, 2020

This RFC is basically a mirror of BurntSushi/regex-automata#7

The idea here is that I would like to make premultiply=true, byte_classes=true be the only option that one can choose for DFAs. (It is currently the default for DFAs.) I think it was a mistake to expose these knobs, as disabling them doesn't really confer benefits (if any). But they are quite costly in terms of code size and clarity.

Are there any strenuous objections to removing these options?

@vks
Copy link

vks commented Apr 25, 2021

I'm not sure how many users of aho-corasick will read this here. Maybe just go ahead and deprecate it? If there are a lot of objections, you can always revert the deprecation.

@BurntSushi
Copy link
Owner Author

Yeah I should have updated this. I'm not holding myself up here as a result of the lack of feedback. My plan is to just to go forward with this. Just haven't got a chance to do it. It's not super high priority.

@BurntSushi
Copy link
Owner Author

Deprecating them sounds smart though. Might be a good forcing function.

@vks
Copy link

vks commented Apr 25, 2021

Sure, just removing it in the next breaking release also works.

I just thought that deprecating would be the most reliable way to get potential objections. FWIW, I cannot find any code on GitHub setting premultiply(false). For bytes_classes(false), I only found one use in tokei.

@BurntSushi
Copy link
Owner Author

@vks Yeah, there are some cases in which disabling byte classes can make the resulting search faster, but I've only ever seen slight improvements. (I've also seen regressions due to the increased size in the automaton.)

BurntSushi added a commit that referenced this issue Apr 30, 2021
These options aren't really carrying their weight. In a future release,
aho-corasick will make both options enabled by default all the time with
the impossibility of disabling them. The main reason for this is that
these options require a quadrupling in the amount of code in this crate.

While it's possible to see a performance hit when using byte classes, it
should generally be very small. The improvement, if one exists, just
doesn't see worth it.

Please see #57 for more
discussion.

This is meant to mirror a similar decision occurring in regex-automata:
BurntSushi/regex-automata#7.
BurntSushi added a commit that referenced this issue Apr 30, 2021
These options aren't really carrying their weight. In a future release,
aho-corasick will make both options enabled by default all the time with
the impossibility of disabling them. The main reason for this is that
these options require a quadrupling in the amount of code in this crate.

While it's possible to see a performance hit when using byte classes, it
should generally be very small. The improvement, if one exists, just
doesn't see worth it.

Please see #57 for more
discussion.

This is meant to mirror a similar decision occurring in regex-automata:
BurntSushi/regex-automata#7.
@BurntSushi
Copy link
Owner Author

In the next semver incompatible release, I am also planning to remove the ability to construct automatons using a smaller state ID representation. Instead, it will be fixed to u32 with no way to change it. See the corresponding comment on the regex-automata issue for more details: BurntSushi/regex-automata#7 (comment)

BurntSushi added a commit that referenced this issue Jan 5, 2023
This commit nearly wipes the slate clean with respect to the
Aho-Corasick implementations and rebuilds things from scratch. The main
thing that motivated this was actually to make it possible for
Aho-Corasick automata to support both unanchored and anchored searches,
without having to build two distinct automata for that purpose. (And the
reason why I wanted that was to support such a configuration from
within regex-automata.)

However, it quickly turned into something else entirely. This commit:

* Splits the old NFA into two: a noncontiguous NFA that always uses a
sparse representation and a contiguous NFA that uses both dense and
sparse representations. The old NFA was always noncontiguous but also
used dense and sparse representations for state transitions. This split
makes it possible to better optimize the NFA. Namely, its contiguous
nature has better cache effects and allows us to *substantially* reduce
memory usage (sometimes by an order of magnitude) when compared to the
old NFA.
* Both of the NFAs now unconditionally support both anchored and
unanchored searches with essentially no additional cost.
* The DFA remains mostly the same as before, except its stride is now
always a power of 2 and it can be made to support both unanchored and
anchored searches by copying the transition table and tweaking it. (The
copy is why only unanchored searches are supported by default.)
* We borrow a stripped down version of the 'Input' abstraction from my
work-in-progress on the 'regex-automata' crate. This reduces the
combinatorial explosion of different search routines while making common
cases ergonomic. For example, both 'anchored' and 'earliest' are now
configuration options on an 'Input'.
* The main AhoCorasick type now has fallible APIs in addition to
infallible APIs. It is still reasonable to use infallible APIs, as they
panic only based on incorrect combinations of build and search
configurations. Both of which may be static properties of a program, and
thus, an improper configuration is quite likely (but not necessarily) to
be a programmer error.
* The underlying Aho-Corasick implementations are now exposed through
sub-modules, and an 'Automaton' trait unifies their APIs. This now makes
it possible for callers to walk the automaton directly like any old
finite state machine.
* A no-std but with-alloc mode was added. This essentially just gets rid
of the streaming APIs (that require std::io) and the std::error::Error
trait impls.
* The old (and awkward) 'auto_configure' API is now gone. Instead,
automatic configuration is now the default.
* The old 'premultiply' option is now gone, and is now always enabled.
* The old 'byte_classes' option remains, but is enabled by default and
disabling it no longer confers any performance benefit. Instead, it only
provides a debuggability benefit by making state transitions easier to
understand.
* 'AhoCorasick' is no longer generic over the representation for its
state identifiers. All Aho-Corasick automatons now unconditionally use
'u32' for this. This is in practice big enough for most use cases, but
small enough to provide reasonable memory usage.

The real star of this show is the contiguous NFA. It not only
significantly reduces heap memory used, but is also quite a bit faster
than the old NFA while being only slightly slower to build. It doesn't
quite match the speed of a DFA, but it is quite close. These properties
combined make it an excellently balanced choice. Indeed,
'AhoCorasick::new' will use it for almost all cases, excepting when the
number of patterns is very small.

Other than that, the central type of this crate remains 'AhoCorasick',
and its most common use cases will work almost unchanged from
'aho-corasick 0.7'.

Closes #57, Closes #83
BurntSushi added a commit that referenced this issue Jan 5, 2023
This commit nearly wipes the slate clean with respect to the
Aho-Corasick implementations and rebuilds things from scratch. The main
thing that motivated this was actually to make it possible for
Aho-Corasick automata to support both unanchored and anchored searches,
without having to build two distinct automata for that purpose. (And the
reason why I wanted that was to support such a configuration from
within regex-automata.)

However, it quickly turned into something else entirely. This commit:

* Splits the old NFA into two: a noncontiguous NFA that always uses a
sparse representation and a contiguous NFA that uses both dense and
sparse representations. The old NFA was always noncontiguous but also
used dense and sparse representations for state transitions. This split
makes it possible to better optimize the NFA. Namely, its contiguous
nature has better cache effects and allows us to *substantially* reduce
memory usage (sometimes by an order of magnitude) when compared to the
old NFA.
* Both of the NFAs now unconditionally support both anchored and
unanchored searches with essentially no additional cost.
* The DFA remains mostly the same as before, except its stride is now
always a power of 2 and it can be made to support both unanchored and
anchored searches by copying the transition table and tweaking it. (The
copy is why only unanchored searches are supported by default.)
* We borrow a stripped down version of the 'Input' abstraction from my
work-in-progress on the 'regex-automata' crate. This reduces the
combinatorial explosion of different search routines while making common
cases ergonomic. For example, both 'anchored' and 'earliest' are now
configuration options on an 'Input'.
* The main AhoCorasick type now has fallible APIs in addition to
infallible APIs. It is still reasonable to use infallible APIs, as they
panic only based on incorrect combinations of build and search
configurations. Both of which may be static properties of a program, and
thus, an improper configuration is quite likely (but not necessarily) to
be a programmer error.
* The underlying Aho-Corasick implementations are now exposed through
sub-modules, and an 'Automaton' trait unifies their APIs. This now makes
it possible for callers to walk the automaton directly like any old
finite state machine.
* A no-std but with-alloc mode was added. This essentially just gets rid
of the streaming APIs (that require std::io) and the std::error::Error
trait impls.
* The old (and awkward) 'auto_configure' API is now gone. Instead,
automatic configuration is now the default.
* The old 'premultiply' option is now gone, and is now always enabled.
* The old 'byte_classes' option remains, but is enabled by default and
disabling it no longer confers any performance benefit. Instead, it only
provides a debuggability benefit by making state transitions easier to
understand.
* 'AhoCorasick' is no longer generic over the representation for its
state identifiers. All Aho-Corasick automatons now unconditionally use
'u32' for this. This is in practice big enough for most use cases, but
small enough to provide reasonable memory usage.

The real star of this show is the contiguous NFA. It not only
significantly reduces heap memory used, but is also quite a bit faster
than the old NFA while being only slightly slower to build. It doesn't
quite match the speed of a DFA, but it is quite close. These properties
combined make it an excellently balanced choice. Indeed,
'AhoCorasick::new' will use it for almost all cases, excepting when the
number of patterns is very small.

Other than that, the central type of this crate remains 'AhoCorasick',
and its most common use cases will work almost unchanged from
'aho-corasick 0.7'.

Closes #57, Closes #83
bestsoftwaretopappreviews21 added a commit to bestsoftwaretopappreviews21/aho-corasick that referenced this issue Sep 12, 2024
These options aren't really carrying their weight. In a future release,
aho-corasick will make both options enabled by default all the time with
the impossibility of disabling them. The main reason for this is that
these options require a quadrupling in the amount of code in this crate.

While it's possible to see a performance hit when using byte classes, it
should generally be very small. The improvement, if one exists, just
doesn't see worth it.

Please see BurntSushi/aho-corasick#57 for more
discussion.

This is meant to mirror a similar decision occurring in regex-automata:
BurntSushi/regex-automata#7.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants