Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffer data structure #17

Open
CeleritasCelery opened this issue Jan 14, 2023 · 18 comments
Open

Buffer data structure #17

CeleritasCelery opened this issue Jan 14, 2023 · 18 comments
Labels
design needed Items where more design help is needed

Comments

@CeleritasCelery
Copy link
Owner

CeleritasCelery commented Jan 14, 2023

The two main options for a buffer struct is a gap buffer, or a rope. They both come with different tradeoffs. I really like this comment from the author of a Rust rope crate.

I don't think the tradeoff between gap buffers and ropes is really about file size. At small file sizes anything will work. You could just use a plain contiguous string if you wanted, and it would be totally fine. The whole point of using something like a gap buffer or rope or piece table or whatever is to avoid the shortcomings of something like a contiguous string as files get larger.

Gap buffers are actually great for large files, as long as the editing patterns are favorable. Specifically, for localized edits gap buffers are O(1), which is amazing. And for single-cursor editors, localized edits are pretty much the only case that needs to be handled interactively. So gap buffers are great (even with huge files) for such editors. However, gap buffers degrade to O(N) as edits get less and less localized. So if you want your editor to support non-localized editing patterns, gap buffers probably aren't great.

Ropes make a different performance trade-off, being a sort of "jack of all trades". They're not amazing at anything, but they're always solidly good with O(log N) performance. They're not the best choice for an editor that only supports local editing patterns, since they leave a lot of performance on the table compared to gap buffers in that case (again, even for huge documents). But for an editor that encourages non-localized edits, or just wants flexibility in that regard, they're a great choice because they always have good performance, whereas gap buffers degrade poorly with unfavorable editing patterns.

In other words, regardless of file size, gap buffers have both a better best case and worse worst case compared to ropes.

It's worth noting, however, that even multiple cursors are frequently quite localized. Most of the time I use them, for example, all cursors are still in view (or nearly so). It's really only with editors that are built around multiple cursors as the way to edit things that you commonly do interactive non-local edits.

So for something like emacs, I suspect any real arguments in favor of ropes or piece tables (or whatever else) aren't going to be about editing performance, but rather are going to be about secondary features like cheaply taking "snapshots" of a document to save it asynchronously, etc.

Like the author said, gap buffers have great performance. And they are much simpler then other data structures like ropes. It also makes things like regex much more performant.

Given all that, I think sticking with a gap buffer is the right choice. Most of complaints about Gap buffers is just misunderstanding.

edit: fixed misquote

@CeleritasCelery CeleritasCelery added the design needed Items where more design help is needed label Jan 14, 2023
@cessen
Copy link

cessen commented Feb 8, 2023

Just a quick note. The quote is (accidentally) cut off at the beginning in a way that reverses the meaning of the first sentence. The full first sentence from the original is (missing part bolded):

I don't think the tradeoff between gap buffers and ropes is really about file size.

It's also worth noting that editing performance is not the only factor in choosing a buffer representation. For example, one of the benefits of things like ropes and piece tables is that you can quickly (in O(1)) take snapshots of the current state of the text, which is useful for things like saving the buffer to disk without blocking editing, doing syntax highlighting in another thread without blocking editing, etc. etc.

But yeah, as long as an editor isn't designed around multiple cursors/selections, I think purely from an editing performance standpoint a gap buffer is going to be really hard to beat. And it does have the benefit of being a beautifully simple data structure.

@CeleritasCelery
Copy link
Owner Author

Thanks for catching that!

One other advantage of ropes is that they allow you to index by characters or line endings in O(logn) time. Without special support that operation is O(n) with a gap buffer.

@cessen
Copy link

cessen commented Feb 14, 2023

Yeah, that's definitely true. Ropes can easily accommodate secondary metrics like lines, etc. for indexing, whereas gap buffers need a separate secondary data structure to store that information if it's needed.

One thing to note, though:

they allow you to index by characters

I've been working in the text problem space for a while now, and I've come to the conclusion that indexing by char (a.k.a. scalar values, in Unicode terminology) isn't actually very useful in practice. I've written elsewhere about how I think Ropey's APIs probably should have been byte-index-based. It's not a problem that they're char-based, and I still think Ropey is an excellent rope library. But it's more like it was more effort than it was worth.

The short version is: since graphemes are actually the displayable "character" unit in unicode text, and since graphemes can be composed of multiple chars, in practice there isn't much difference between chars and bytes in terms of usefulness as an indexing metric. Neither actually represents a character as the end user understands it. And byte indexing is more efficient since you can just directly index into the text array.

(It's tempting to then say, "Well, let's index by grapheme then!" But that's a poor base indexing metric for a variety of reasons, such as inconsistency and performance.)

@CeleritasCelery
Copy link
Owner Author

CeleritasCelery commented Feb 14, 2023

Your point about character indexing is a good one. There is the famous post by Manish Goregaokar titled stop ascribing meaning to codepoints which essentially makes that same argument. However I think char indexing has two advantages over byte indexing. I am curious your take on this.

1. Char indexing is always "valid"

Assuming you truncate char indices beyond the end of the text, char indexing will always return a valid UTF-8 string. It may not be semantically meaningful if your text begin with something like a combining mark, but it is technically still well formed UTF-8. With byte indexing you have to decide how to handle indexes that don't land on a char boundary. Do you panic like Rust does? Do you round to the nearest boundary? If you do round, do you round up or down? I can see a lot of edge cases there. I think this is especially important in the context of interactive applications like text editors where the user may give the bounds for substrings or slices.

2. Indexing by bytes is a leaky abstraction

Byte index is only valid in the context of UTF-8. The unicode standard defines a codepoint, and all formats (UTF-8, UTF-16, UTF-32) agree on what a codepoint is. As we saw in the recent fasterthanlime article about lsp, it can be problem when you expose implementation details like indexing the way javascript does it by UTF-16 surrogates. If the lsp protocol had decided to use Unicode code points instead, then JavaScript, Rust, and Emacs would all be talking the same language by default. This may not be as much of a problem because everything seems to be moving to UTF-8, but if someone ever creates UTF-10-ultra and it starts to become popular, then the leaky abstraction of operating over UTF-8 codeunits starts to show itself.

@cessen
Copy link

cessen commented Feb 15, 2023

The reasons you've given are more-or-less why I chose char indexing for Ropey. And they're also the reasons that I no longer find compelling. I'll do my best to explain why at least semi-succinctly.

  1. Char indexing is always "valid"

While that's true, the reason I no longer find this argument compelling is that if you're working with byte indices and end up e.g. trying to insert text at a non-char boundary, that's a bug in the software. Moreover, it's likely a bug that would have happened with char-based indexing as well, but just would have manifested differently.

So the real question is how to handle such bugs. With char-based indexing, in some sense you're choosing to unconditionally follow through on those buggy, incorrectly-indexed operations, because even if they're logically incorrect they're at least guaranteed to be Unicode-valid. Whereas with byte-based indexing you need to choose how to handle when those bugs lead to Unicode-invalid operations as well.

In something like a text editor, you might choose to e.g. just not apply Unicode-invalid edits. Or you might choose to round to the nearest codepoint boundary in one or the other direction. But it's almost arbitrary which behavior you go with because you've already lost correctness at that point (and would have with char indexing as well) because the situation is due to a bug. So there's no right answer, it's just about doing the thing that's least likely to be unduly disruptive and/or lose data.

I think this is especially important in the context of interactive applications like text editors where the user may give the bounds for substrings or slices.

For user interaction, I can't think of any situation where the user should be directly specifying byte offsets. But this can certainly come up with e.g. scripting support, for sure.

But again, the situations where invalid edits (or slicing, or whatever) would be happening are bugs in the scripts anyway, so any reasonably graceful way of handling them is fine. Moreover, the scripting APIs should (ideally) make working in terms of grapheme boundaries straightforward, and have good examples, so that users can fall into a "pit of success" there.

  1. Indexing by bytes is a leaky abstraction

I actually feel the opposite, now. IMO, how you're actually storing the text (a.k.a. the encoding) is the real thing, and Unicode scalar values (a.k.a chars in Rust, and what people usually actually mean when they say code point) are the abstraction. Granted, they're a solid abstraction. But an abstraction nonetheless.

The problems that people run into with indexing mismatches in e.g. LSP implementations and other similar situations are due to an abstraction mismatch, not an abstraction leak. People think LSP uses the scalar value abstraction when in fact it works in the lower-level utf16 code units.

For something like LSP, I think there's a good argument to be made that it probably should be working at the scalar value level of abstraction. But, unfortunately, it doesn't. And if anything, making people more conscious of the actual encodings that different systems use will help reduce mismatches across these kinds of system boundaries. Or at the very least, char indexing doesn't seem to help prevent those mismatches, because Helix (which uses Ropey, and thus char indexing) had a similar bug in its LSP implementation as the article you linked to.

Just generally, I think it makes sense for systems to work internally in terms of their native encodings and just translate at the system boundaries as needed. As opposed to always working in terms of an abstraction that may or may not match what's needed at the system boundary, and which might need to be translated anyway.

@CeleritasCelery
Copy link
Owner Author

I think you have convinced me to use graphemes as an abstraction instead of codepoints. Emacs uses codepoints as it's abstraction, and when testing it with multi-codepoint graphemes the incompatibility shows. The hard part is that Emacs already has the convention around indexing into strings via codepoint, or manipulating the cursor via arithmetic. I wrote some thoughts about how we might handle that here.

@cessen
Copy link

cessen commented Feb 19, 2023

Thanks for the link!

(Note: I'm not an Emacs guy, so I could be speaking out of ignorance in some of my comments below.)

Ah! I didn't realize that Emacs already used char-based indexing. Due to its age, I kind of assumed it used byte-based indexing.

Given that that's the case, I think it probably would make sense to stick with char-based indexing for compatibility reasons...? There's nothing specifically wrong with char-based indexing, it's just less efficient than byte-based. But a little loss in efficiency is probably worth it for compatibility in this case, I would think.

Your finding that grapheme-based indexing is slow is right on point. But additionally, there are times when you really do need or want to manipulate things at a finer-grained level than graphemes. For example, a script author may want to remove just the accent off an "é". And having all indexing and operations work in terms of graphemes makes that kind of thing difficult. So having some way of working in terms of a lower-level indexing scheme (whether that be chars or bytes) is still important.

I think what you probably want to do is leave the existing elisp APIs as-is, but provide additional facilities for finding the prev/next grapheme boundary, incrementing/decrementing the cursor to the prev/next grapheme boundary, deleting up to the prev/next grapheme boundary, etc. etc. so that fixing incorrect scripts is trivial by just replacing the appropriate function calls and index arithmetic. This doesn't have quite the same "pit of success" effect as just changing the behavior of the existing char-based functions and index arithmetic, but I think is more sound from a technical perspective. And good documentation for aspiring script authors can guide people to use the grapheme-based functions where appropriate.

Emacs already has the convention around indexing into strings via codepoint

A minor note about Unicode terminology: technically speaking, the correct term here is "scalar value" not "code point". It's not a big deal, and pretty much everyone misuses the term "code point" (including myself a handful of years back before Andrew Gallant of regex/ripgrep fame corrected me), so there's no shame here. But the technically correct term for "those things that have defined Unicode mappings (or are reserved for mappings) and which correspond to chars in Rust" is Unicode scalar value.

Code points are a bit broader, and include the values used for surrogate pairs in utf16.

The WTF-8 encoding spec provides some background about this, if you're curious. But the short version is: code points are effectively equivalent to scalar values in utf8, but are effectively equivalent to code units in utf16. So when talking about utf8, code point and scalar value are basically interchangeable. But the encoding-agnostic term is scalar value.

(Edit: actually, saying code point and code unit are equivalent in utf16 isn't quite right. Rather, all utf16 code units are also valid code points. But code points are, of course, a superset of utf16 code units since there's a lot more than 216 of them.)

@josephg
Copy link

josephg commented Feb 27, 2023

Jumprope / diamond types (CRDT) author here. We ended up having a discussion about this in a recent reddit thread.

A few comments:

Given all that, I think sticking with a gap buffer is the right choice. Most of complaints about Gap buffers is just misunderstanding.

In JumpRope I ended up putting a gap buffer in each leaf node in my skip list. This roughly doubled performance in real-world cases. Its really nice because:

  • Unlike a traditional gap buffer, each leaf ends up as a fixed size in memory, which makes them really fast.
  • Users don't move their cursors much, and when they do they often bounce out of that leaf node entirely anyway. This approach makes the common case (typing adjacent characters) super fast
  • The efficient size I found for the gap buffers (based on a lot of benchmarking) is quite large (~400 bytes). This reduces pressure on the rest of the tree, and makes the tree shallower & faster.
  • Gap buffers are super simple

It ends up a bit less memory-efficient (the gap buffers usually have a lot of unused space), but its not such a big problem these days on modern computers.

It'd be fun (and pretty easy) to add a gap buffer to some of the benchmarks we have kicking around. I'd use my own from jumprope, but it doesn't support reallocating the gap buffer size.

The reasons you've given are more-or-less why I chose char indexing for Ropey. And they're also the reasons that I no longer find compelling.

I went into more detail in the reddit thread. In short, using codepoint indexes makes invalid state unrepresentable. (Insert positions within a character). When writing efficient CRDTs, its really handy not to have to worry about this particular error state - especially over the network where you have to expect malicious peers to show up sometimes.

I increasingly feel like ropes are essentially text buffers + indexes. The indexes you sometimes want are:

  • UTF8 bytes (maybe?)
  • Line + column positions
  • Grapheme cluster positions (?)
  • Codepoint indexes (for CRDTs)
  • Wchar / surrogate pair counts (for converting to / from cursor positions in other languages and environments)

Each index is somewhat expensive to maintain - adding wchar support to jumprope reduces performance by ~30% or so. But there's no "obvious" set of indexes. It really depends on the use case. Ropey seems to store a maximalist set of those indexes - which is honestly very convenient sometimes. And jumprope stores the minimum. (Just codepoint indexes, and wchar indexes if you enable a specific feature flag in the build). This is probably too minimal to build a text editor on top of.

I think you have convinced me to use graphemes as an abstraction instead of codepoints.

Ahhhh dies. I can see how this could make sense for a monolithic text editor, but its a terrible idea as part of any public API, or for a networked system. The problem is that different components might be compiled with different versions of the unicode spec. Each version of unicode changes the rules for when codepoints merge into grapheme clusters. So position 560 today might be position 558 in the document after you update your unicode library and recompile.

@CeleritasCelery
Copy link
Owner Author

It'd be fun (and pretty easy) to add a gap buffer to some of the benchmarks we have kicking around. I'd use my own from jumprope, but it doesn't support reallocating the gap buffer size.

I have actually been working on gap buffer implementation. And I have been using your crdt test data to both find bugs and benchmark. Here are the initial results from my unoptimized implementation on the realworld benchmarks.

realworld/JumpRope/automerge-paper
                        time:   [9.8875 ms 9.9132 ms 9.9393 ms]
realworld/GapBuffer/automerge-paper
                        time:   [8.1401 ms 8.1459 ms 8.1519 ms]
realworld/JumpRope/rustcode
                        time:   [1.7405 ms 1.7502 ms 1.7597 ms]
realworld/GapBuffer/rustcode
                        time:   [1.5364 ms 1.5383 ms 1.5405 ms]
realworld/JumpRope/sveltecomponent
                        time:   [654.96 µs 655.34 µs 655.71 µs]
realworld/GapBuffer/sveltecomponent
                        time:   [417.61 µs 418.12 µs 418.61 µs]
realworld/JumpRope/seph-blog1
                        time:   [4.0272 ms 4.0431 ms 4.0625 ms]
realworld/GapBuffer/seph-blog1
                        time:   [3.0263 ms 3.0385 ms 3.0550 ms]

As you would probably expect It is slightly faster, but not in a major way. It also blows up pretty bad on some of the artificial benchmarks. For example on stable_ins_del/*/10000000 benchmark it is about 500x worse. I have some ideas about how to improve that, but I am not really concerned with it because even in 99% percentile the gap buffer is still under 200us in that benchmark, which is well below human perception. For interactive editing, any of the data structures would be just fine.

My current plan is store the data in gap buffer and keep the metrics (char count, line count, maybe wchar?) in a rope. This will enable O(logn) jumping to an arbitrary line/char. Why not just keep everything in a rope then? For me it really comes down to searching. There is no good way to perform a regex search against a rope. For example in xi-editor they copy the entire rope into a dedicated array, run the regex, and then throw it away. This adds O(n) overhead to every multi-line search. And unfortunately there is really no way around this. Helix editor is also struggling with this limitations of ropes. They were trying to use the low-level regex-automata but it has a lot of performance holes and footguns, so they gave up. There is an open issue on the regex crate to allow regex on streams, but the maintainer has said that even if that is implemented it will not be able to match the performance of operating on slices. Gap buffers avoid the search problem because they already store data in an array. You just need to move the gap out of the search window.

I can see how this could make sense for a monolithic text editor, but its a terrible idea as part of any public API, or for a networked system. The problem is that different components might be compiled with different versions of the unicode spec. Each version of unicode changes the rules for when codepoints merge into grapheme clusters. So position 560 today might be position 558 in the document after you update your unicode library and recompile.

That's a fair point, and I can see how that would be an issue for something that is talking to unknown clients. Though as you said that is not really an issue for monolithic applications.

codepoints and graphemes don't have to be mutually exclusive. You can enable arbitrary indexing by codepoints while still preventing splitting graphemes. That is the direction I am leaning now. Though as you point out this doesn't necessarily need to be "part of" the buffer. The grapheme logic could be separated, and that is probably the smartest way to handle it.

That being said, gap buffers have an extra incentive to use byte indexing that ropes do not. Byte indexing is O(1) in gap buffer, but it is still O(logn) in a rope. This probably doesn't matter as much in practice, but could show up in huge buffers.

One thing that I do think is a mistake is making your indexing scheme graphemes. This is what was done by xi-editor (part of the reason I believe it is so slow) and swift. The author of xi-editor has a retrospective looking at why he thinks grapheme indexing is bad idea.

@cessen
Copy link

cessen commented Feb 27, 2023

For me it really comes down to searching. There is no good way to perform a regex search against a rope. For example xi-editor/xi-editor#1192 they copy the entire rope into a dedicated array, run the regex, and then throw it away. This adds O(n) overhead to every multi-line search.
[...]
Gap buffers avoid the search problem because they already store data in an array. You just need to move the gap out of the search window.

To be fair, moving the gap also adds O(n) overhead in the general case. And I think it's probably pretty common for users to search the whole document, which would force the worst case. The constant factor is lower, of course, because the worst case is the gap being smack in the middle of the document, which would only require moving half the contents rather than the entire contents. But still.

(Edit: I just realized, you probably meant space overhead rather than time overhead. Indeed, gap buffers avoid the space overhead, which is a good point.)

For documents of typical size, I think xi-rope's approach is actually a reasonable stop-gap solution. But the real solution is a regex engine that can handle non-contiguous text in memory. That would even be preferable for a gap buffer based editor, I think.

They were trying to use the low-level regex-automata but it has a lot of performance holes and footguns, so they gave up.

IIRC, the main issue is that regex-automata currently only has a DFA engine that doesn't implement lazy building. DFA is great for raw search performance, but without lazy building the building itself can explode in time and memory for some regex patterns. And I think that's the issue Helix was running into.

My impression is that @BurntSushi plans to eventually migrate all the regex engines to regex-automata (presumably in a way the permits streaming/non-contiguous text), at which point using either a lazy DFA or an NFA would work well. I'm not aware of a specific timeline for that happening, though, which makes sense because despite all appearances @BurntSushi is a mere mortal like the rest of us and has limited time/energy. But yeah, it does mean that for the time being, short of building your own regex engine, these trade offs/work-arounds are necessary.

One thing that I do think is a mistake is making your indexing scheme graphemes. This is what was done by xi-editor

I don't think this is true of xi-editor...? It definitely had facilities for finding and working with grapheme boundaries, but I don't think it indexed by them. Or at least the xi-rope component didn't: it used byte indexing.

@josephg
Copy link

josephg commented Feb 27, 2023

There is an rust-lang/regex#425 on the regex crate to allow regex on streams, but the maintainer has said that even if that is implemented it will not be able to match the performance of operating on slices.

Interesting discussion. I have an instinct that there should be some trait API that could wrap a &str / &[u8] or wrap a rope / gap buffer library that would give the regex everything that it needs.

The simplest API would of course be:

trait StringAccessor {
  fn byte_at(&self, offset: usize) -> u8;
}

... That would be trivially implemented by slices (by doing array lookups) and probably wouldn't be too bad for ropes, if we cached the current rope chunk. But yeah, there's probably something higher performance that would allow the regex to drive forwards and backwards through the rope, like:

trait StringAccessor {
  fn read_chunk(&self, start_byte: usize, into: &mut [u8]) -> usize;
}

... But that would be a slower API to implement on top of a string slice.

Anyway, probably all thoughts that have already been said somewhere else in that monster thread.

I have actually been working on gap buffer implementation. And I have been using your crdt test data to both find bugs and benchmark. Here are the initial results from my unoptimized implementation on the realworld benchmarks.

Thats super cool! I'm glad to see the performance improve on jumprope. The thing I like most about gap buffers is how simple the implementation can be. And we talk a lot about how O(n) performance is bad, but the memcpy operations that go on in gap buffers are constantly way faster than my intuition tells me they should be. Like, iterating a linked list and moving the gap in a gap buffer are theoretically the same time complexity. But iterating a linked list is (I think) hundreds of times slower than the memcpy operation because of how modern memory hierarchies work. And humans don't actually edit at literally random locations in a text document. So I don't think it even really matters what the performance in that case looks like.

I also think a gap buffer like that might be a great fit for something like a text editor in the browser, where you don't care about optimizing for 1 gb text documents. But you do want the compiled webassembly to stay tiny. (But maybe adding the index over the top would make it just as complex as jumprope anyway. Eh!)

@cessen
Copy link

cessen commented Feb 28, 2023

I have an instinct that there should be some trait API that could wrap a &str / &[u8] or wrap a rope / gap buffer library that would give the regex everything that it needs.

The hard part isn't the API, unfortunately. The issue is that the internal engine implementations currently assume a contiguous buffer, and it's non-trivial to update them to work with non-contiguous buffers. (I made an attempt at doing so for the regex crate's pike vm a while back, so I'm speaking from experience.)

Designing the API is easy. It's updating all the low-level core code that's tricky.

@BurntSushi
Copy link

My impression is that @BurntSushi plans to eventually migrate all the regex engines to regex-automata (presumably in a way the permits streaming/non-contiguous text), at which point using either a lazy DFA or an NFA would work well. I'm not aware of a specific timeline for that happening, though, which makes sense because despite all appearances @BurntSushi is a mere mortal like the rest of us and has limited time/energy. But yeah, it does mean that for the time being, short of building your own regex engine, these trade offs/work-arounds are necessary.

I want to be clear here that searching streams is not really a problem I'm working on right now. My main goal is getting regex ported over to regex-automata and exposing the underlying regex engines. I think that work will ultimately lead to more fruitful experimentation with streams. For example, the full DFA and lazy DFA both have "byte at a time" APIs that permit walking the state machine. So in theory, you should be able to use either to implement whatever kind of search you want. The major caveat here is that both DFAs support everything except for Unicode word boundaries. So if you just try to use the lazy DFA for everything, you'd have to probably translate \b to (?-u:\b).

The NFA engines (PikeVM and backtracker) are harder. They do not currently expose "byte at a time" APIs. So it's not really possible to implement stream searching with how they currently work. Putting aside the backtracker because you probably don't want that anyway (it only works for small regexes/haystacks), the PikeVM is probably the answer to "execute any regex on a stream." But in order to do it, either the API needs to evolve to support streams, or y'all will need to build your own PikeVM. I strongly encourage you to do the latter if the lazy DFA is insufficient for your use case. For a few practical reasons:

  1. I am unlikely to have any bandwidth any time soon to work on the streaming API evolution required.
  2. Building your own PikeVM, once regex-automata 0.3 is out, should not be a daunting prospect. Namely, you'll be able to reuse regex-automata's NFA compiler, which is a monumental source of complexity (largely because of Unicode). The actual core bits of the PikeVM are probably only a few hundred lines or so, less if you can remove things you don't need (like capture groups?).

@josephg Yes, APIs like that have been proposed in that regex issue thread. Consider though that "stream" doesn't just mean "search on text in memory but that is not stored contiguously." It might also mean "search something that implements std::io::Read." The latter is a fundamentally different problem I think, and one that isn't solved by your API for example. It's plausible that we shouldn't try to solve it at all.

And indeed, updating the regex engines to support that API is a daunting prospect. A major part of the perf of a regex engine is literal prefilters, and it is crazy difficult to make these work with APIs such as yours (even adaptations that expose slices of bytes). There are many issues, but consider one: if you use a prefilter on a slice of bytes and it finds nothing. Now you move on to the next slice and find nothing. But oops, you might have just missed a match that was spread across the boundary of the slices. If you were just running a state machine, then your state machine would be in a specific state after the end of the first slice, and then pick up where it left off at the beginning of the next and find the match. But once you add prefilters to the mix, this conceptual model breaks down. Fundamentally, this why I think a lot people consider stream searching to be feasible. It's because their mental model of how the regex engine actually works is extremely imprecise. Even something as simple as Aho-Corasick doesn't fit this mental model because it to also has prefilters. And indeed, the aho-corasick crate supports stream searching, but it disables all but the most trivial of prefilters when doing so.

That Hyperscan manages to do this is no small miracle. If you look at its API reference and search for HS_MODE_SOM_HORIZON_LARGE, that might tip you off as to how this sort of thing actually gets implemented. That is, even though it's streaming, it actually needs to keep some chunk of the stream in memory for things to work. And that if the chunk isn't big enough... well... too bad I guess!

Then you have the problem of how you tie all this together. It's tempting to consider your API proposal (or one similar to it) as the "fundamental" API. But as you can see, it doesn't work well with prefilters. So does that mean the internal regex engines need two entirely different code paths? I dunno. I don't know how to square that circle.

It is for this reason that I think folks should go off and figure out how to make regex-automata 0.3 work with stream searching. Hopefully the lazy DFA will be somewhat simple to use, depending on what it is you need from it. On the other side of that is the re-implementing the PikeVM for your specific use case, which I do hope to be reasonable. It's still tricky work, but you'll be standing on quite a bit of infrastructure to even get to that point. My guess is that, if you want to support Unicode word boundaries, you'll want to:

  1. Forget about prefilters.
  2. Use the lazy DFA whenever you can because it's much faster than the PikeVM. Usually by an order of magnitude.
  3. Use your own PikeVM to handle Unicode word boundaries.

@BurntSushi
Copy link

Oh and yeah, I don't commit to time tables for releases. I do hope for regex-automata 0.3 to be out in the next few months.

@cessen
Copy link

cessen commented Feb 28, 2023

Thanks for all the details, @BurntSushi!

I want to be clear here that searching streams is not really a problem I'm working on right now. My main goal is getting regex ported over to regex-automata and exposing the underlying regex engines.
[...]
The NFA engines (PikeVM and backtracker) are harder. They do not currently expose "byte at a time" APIs.
[...]
But in order to do it, either the API needs to evolve to support streams, or y'all will need to build your own PikeVM. I strongly encourage you to do the latter if the lazy DFA is insufficient for your use case.

That makes a lot of sense, thanks!

After regex-automata 0.3 is released would you still prefer that people implement their own PikeVM for that use case, or would you be open to PRs (with appropriate discussion about how best to do it) to change regex-automata's PikeVM to work a byte at a time? I don't expect to have the bandwidth for something like that super soon, but I might by the time 0.3 is released.

My gut feeling is that for text editors just a straight up PikeVM is probably the way to go. Granted, it's slower than a DFA, and then even slower again than a DFA + literal optimizations. But functionally it covers all the use cases, and I suspect is still more than fast enough since typically you aren't trying to crunch through an entire repo's worth of text in a text editor. (There's always the occasional huge file, of course, but I feel like those cases are best dealt with by making the search non-blocking.)

@BurntSushi
Copy link

After regex-automata 0.3 is released would you still prefer that people implement their own PikeVM for that use case, or would you be open to PRs (with appropriate discussion about how best to do it) to change regex-automata's PikeVM to work a byte at a time?

Yes, that's what I meant with "I am unlikely to have any bandwidth any time soon to work on the streaming API evolution required." And in particular, I'd really want to see a real implementation working. That will help a lot with API design. And I should also say, it's not clear to me that I will ever evolve the APIs. It might just be something that needs to be done by someone else, or built bespoke for each.

With respect to using the PikeVM, I'll note these things:

  • Throughput is probably "low tens of MB/s." Where as a lazy DFA will be "hundreds of MB/s."
  • At least currently, I would expect streaming search with the lazy DFA to be easier to implement. Because in that case, you don't have to build your own regex engine, like you do for the PikeVM. You just need to use an existing DFA, which is in and of itself a regex engine.

But you are of course right that most searches in a text editor will likely be on small files.

@cessen
Copy link

cessen commented Feb 28, 2023

That will help a lot with API design. And I should also say, it's not clear to me that I will ever evolve the APIs.

So, just to make sure we're on the same page: what I'm talking about is not the regex crate. I basically assume those APIs will stay the same. Particularly because I'm increasingly convinced that the applications that want to do things like non-contiguous or streaming searches would probably be better served by lower-level control via something like regex-automata anyway.

I'm only talking about changing the lower-level PikeVM in regex-automata. (That would of course have impact on the internal code of the regex crate to adjust to that, but not its public APIs.)

In any case, regardless, it sounds like implementing a separate PikeVM is a good way to go, for exploration etc. as you pointed out. Thanks!

@BurntSushi
Copy link

Yes, that's what I'm talking about too.

With that said, I don't mean this in a "don't talk to me or ask me questions about this." I would very much welcome anyone working on this to open a new Discussion on the regex repo where we can chat. Just because I probably won't have the bandwidth any time soon to work on the streaming API use case doesn't mean I won't be amenable to talking about it or even doing smaller forms of API evolution. For example, if you get to a point where you're like, "well I could do things this way, but only if that internal routine in regex-automata is exposed, otherwise I have to go write 1,000 extra lines of code." Definitely reach out to me if you wind up in a situation like that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design needed Items where more design help is needed
Projects
None yet
Development

No branches or pull requests

4 participants