-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Buffer data structure #17
Comments
Just a quick note. The quote is (accidentally) cut off at the beginning in a way that reverses the meaning of the first sentence. The full first sentence from the original is (missing part bolded):
It's also worth noting that editing performance is not the only factor in choosing a buffer representation. For example, one of the benefits of things like ropes and piece tables is that you can quickly (in O(1)) take snapshots of the current state of the text, which is useful for things like saving the buffer to disk without blocking editing, doing syntax highlighting in another thread without blocking editing, etc. etc. But yeah, as long as an editor isn't designed around multiple cursors/selections, I think purely from an editing performance standpoint a gap buffer is going to be really hard to beat. And it does have the benefit of being a beautifully simple data structure. |
Thanks for catching that! One other advantage of ropes is that they allow you to index by characters or line endings in O(logn) time. Without special support that operation is O(n) with a gap buffer. |
Yeah, that's definitely true. Ropes can easily accommodate secondary metrics like lines, etc. for indexing, whereas gap buffers need a separate secondary data structure to store that information if it's needed. One thing to note, though:
I've been working in the text problem space for a while now, and I've come to the conclusion that indexing by The short version is: since graphemes are actually the displayable "character" unit in unicode text, and since graphemes can be composed of multiple (It's tempting to then say, "Well, let's index by grapheme then!" But that's a poor base indexing metric for a variety of reasons, such as inconsistency and performance.) |
Your point about character indexing is a good one. There is the famous post by Manish Goregaokar titled stop ascribing meaning to codepoints which essentially makes that same argument. However I think char indexing has two advantages over byte indexing. I am curious your take on this. 1. Char indexing is always "valid"Assuming you truncate char indices beyond the end of the text, char indexing will always return a valid UTF-8 string. It may not be semantically meaningful if your text begin with something like a combining mark, but it is technically still well formed UTF-8. With byte indexing you have to decide how to handle indexes that don't land on a char boundary. Do you panic like Rust does? Do you round to the nearest boundary? If you do round, do you round up or down? I can see a lot of edge cases there. I think this is especially important in the context of interactive applications like text editors where the user may give the bounds for substrings or slices. 2. Indexing by bytes is a leaky abstractionByte index is only valid in the context of UTF-8. The unicode standard defines a codepoint, and all formats (UTF-8, UTF-16, UTF-32) agree on what a codepoint is. As we saw in the recent fasterthanlime article about lsp, it can be problem when you expose implementation details like indexing the way javascript does it by UTF-16 surrogates. If the lsp protocol had decided to use Unicode code points instead, then JavaScript, Rust, and Emacs would all be talking the same language by default. This may not be as much of a problem because everything seems to be moving to UTF-8, but if someone ever creates UTF-10-ultra and it starts to become popular, then the leaky abstraction of operating over UTF-8 codeunits starts to show itself. |
The reasons you've given are more-or-less why I chose
While that's true, the reason I no longer find this argument compelling is that if you're working with byte indices and end up e.g. trying to insert text at a non-char boundary, that's a bug in the software. Moreover, it's likely a bug that would have happened with So the real question is how to handle such bugs. With In something like a text editor, you might choose to e.g. just not apply Unicode-invalid edits. Or you might choose to round to the nearest codepoint boundary in one or the other direction. But it's almost arbitrary which behavior you go with because you've already lost correctness at that point (and would have with char indexing as well) because the situation is due to a bug. So there's no right answer, it's just about doing the thing that's least likely to be unduly disruptive and/or lose data.
For user interaction, I can't think of any situation where the user should be directly specifying byte offsets. But this can certainly come up with e.g. scripting support, for sure. But again, the situations where invalid edits (or slicing, or whatever) would be happening are bugs in the scripts anyway, so any reasonably graceful way of handling them is fine. Moreover, the scripting APIs should (ideally) make working in terms of grapheme boundaries straightforward, and have good examples, so that users can fall into a "pit of success" there.
I actually feel the opposite, now. IMO, how you're actually storing the text (a.k.a. the encoding) is the real thing, and Unicode scalar values (a.k.a The problems that people run into with indexing mismatches in e.g. LSP implementations and other similar situations are due to an abstraction mismatch, not an abstraction leak. People think LSP uses the scalar value abstraction when in fact it works in the lower-level utf16 code units. For something like LSP, I think there's a good argument to be made that it probably should be working at the scalar value level of abstraction. But, unfortunately, it doesn't. And if anything, making people more conscious of the actual encodings that different systems use will help reduce mismatches across these kinds of system boundaries. Or at the very least, Just generally, I think it makes sense for systems to work internally in terms of their native encodings and just translate at the system boundaries as needed. As opposed to always working in terms of an abstraction that may or may not match what's needed at the system boundary, and which might need to be translated anyway. |
I think you have convinced me to use graphemes as an abstraction instead of codepoints. Emacs uses codepoints as it's abstraction, and when testing it with multi-codepoint graphemes the incompatibility shows. The hard part is that Emacs already has the convention around indexing into strings via codepoint, or manipulating the cursor via arithmetic. I wrote some thoughts about how we might handle that here. |
Thanks for the link! (Note: I'm not an Emacs guy, so I could be speaking out of ignorance in some of my comments below.) Ah! I didn't realize that Emacs already used Given that that's the case, I think it probably would make sense to stick with Your finding that grapheme-based indexing is slow is right on point. But additionally, there are times when you really do need or want to manipulate things at a finer-grained level than graphemes. For example, a script author may want to remove just the accent off an "é". And having all indexing and operations work in terms of graphemes makes that kind of thing difficult. So having some way of working in terms of a lower-level indexing scheme (whether that be I think what you probably want to do is leave the existing elisp APIs as-is, but provide additional facilities for finding the prev/next grapheme boundary, incrementing/decrementing the cursor to the prev/next grapheme boundary, deleting up to the prev/next grapheme boundary, etc. etc. so that fixing incorrect scripts is trivial by just replacing the appropriate function calls and index arithmetic. This doesn't have quite the same "pit of success" effect as just changing the behavior of the existing char-based functions and index arithmetic, but I think is more sound from a technical perspective. And good documentation for aspiring script authors can guide people to use the grapheme-based functions where appropriate.
A minor note about Unicode terminology: technically speaking, the correct term here is "scalar value" not "code point". It's not a big deal, and pretty much everyone misuses the term "code point" (including myself a handful of years back before Andrew Gallant of regex/ripgrep fame corrected me), so there's no shame here. But the technically correct term for "those things that have defined Unicode mappings (or are reserved for mappings) and which correspond to Code points are a bit broader, and include the values used for surrogate pairs in utf16. The WTF-8 encoding spec provides some background about this, if you're curious. But the short version is: code points are effectively equivalent to scalar values in utf8, but are effectively equivalent to code units in utf16. So when talking about utf8, code point and scalar value are basically interchangeable. But the encoding-agnostic term is scalar value. (Edit: actually, saying code point and code unit are equivalent in utf16 isn't quite right. Rather, all utf16 code units are also valid code points. But code points are, of course, a superset of utf16 code units since there's a lot more than 216 of them.) |
Jumprope / diamond types (CRDT) author here. We ended up having a discussion about this in a recent reddit thread. A few comments:
In JumpRope I ended up putting a gap buffer in each leaf node in my skip list. This roughly doubled performance in real-world cases. Its really nice because:
It ends up a bit less memory-efficient (the gap buffers usually have a lot of unused space), but its not such a big problem these days on modern computers. It'd be fun (and pretty easy) to add a gap buffer to some of the benchmarks we have kicking around. I'd use my own from jumprope, but it doesn't support reallocating the gap buffer size.
I went into more detail in the reddit thread. In short, using codepoint indexes makes invalid state unrepresentable. (Insert positions within a character). When writing efficient CRDTs, its really handy not to have to worry about this particular error state - especially over the network where you have to expect malicious peers to show up sometimes. I increasingly feel like ropes are essentially text buffers + indexes. The indexes you sometimes want are:
Each index is somewhat expensive to maintain - adding wchar support to jumprope reduces performance by ~30% or so. But there's no "obvious" set of indexes. It really depends on the use case. Ropey seems to store a maximalist set of those indexes - which is honestly very convenient sometimes. And jumprope stores the minimum. (Just codepoint indexes, and wchar indexes if you enable a specific feature flag in the build). This is probably too minimal to build a text editor on top of.
Ahhhh dies. I can see how this could make sense for a monolithic text editor, but its a terrible idea as part of any public API, or for a networked system. The problem is that different components might be compiled with different versions of the unicode spec. Each version of unicode changes the rules for when codepoints merge into grapheme clusters. So position 560 today might be position 558 in the document after you update your unicode library and recompile. |
I have actually been working on gap buffer implementation. And I have been using your crdt test data to both find bugs and benchmark. Here are the initial results from my unoptimized implementation on the realworld benchmarks.
As you would probably expect It is slightly faster, but not in a major way. It also blows up pretty bad on some of the artificial benchmarks. For example on My current plan is store the data in gap buffer and keep the metrics (char count, line count, maybe wchar?) in a rope. This will enable
That's a fair point, and I can see how that would be an issue for something that is talking to unknown clients. Though as you said that is not really an issue for monolithic applications. codepoints and graphemes don't have to be mutually exclusive. You can enable arbitrary indexing by codepoints while still preventing splitting graphemes. That is the direction I am leaning now. Though as you point out this doesn't necessarily need to be "part of" the buffer. The grapheme logic could be separated, and that is probably the smartest way to handle it. That being said, gap buffers have an extra incentive to use byte indexing that ropes do not. Byte indexing is One thing that I do think is a mistake is making your indexing scheme graphemes. This is what was done by xi-editor (part of the reason I believe it is so slow) and swift. The author of xi-editor has a retrospective looking at why he thinks grapheme indexing is bad idea. |
To be fair, moving the gap also adds O(n) overhead in the general case. And I think it's probably pretty common for users to search the whole document, which would force the worst case. The constant factor is lower, of course, because the worst case is the gap being smack in the middle of the document, which would only require moving half the contents rather than the entire contents. But still. (Edit: I just realized, you probably meant space overhead rather than time overhead. Indeed, gap buffers avoid the space overhead, which is a good point.) For documents of typical size, I think xi-rope's approach is actually a reasonable stop-gap solution. But the real solution is a regex engine that can handle non-contiguous text in memory. That would even be preferable for a gap buffer based editor, I think.
IIRC, the main issue is that My impression is that @BurntSushi plans to eventually migrate all the regex engines to
I don't think this is true of xi-editor...? It definitely had facilities for finding and working with grapheme boundaries, but I don't think it indexed by them. Or at least the xi-rope component didn't: it used byte indexing. |
Interesting discussion. I have an instinct that there should be some trait API that could wrap a The simplest API would of course be: trait StringAccessor {
fn byte_at(&self, offset: usize) -> u8;
} ... That would be trivially implemented by slices (by doing array lookups) and probably wouldn't be too bad for ropes, if we cached the current rope chunk. But yeah, there's probably something higher performance that would allow the regex to drive forwards and backwards through the rope, like: trait StringAccessor {
fn read_chunk(&self, start_byte: usize, into: &mut [u8]) -> usize;
} ... But that would be a slower API to implement on top of a string slice. Anyway, probably all thoughts that have already been said somewhere else in that monster thread.
Thats super cool! I'm glad to see the performance improve on jumprope. The thing I like most about gap buffers is how simple the implementation can be. And we talk a lot about how I also think a gap buffer like that might be a great fit for something like a text editor in the browser, where you don't care about optimizing for 1 gb text documents. But you do want the compiled webassembly to stay tiny. (But maybe adding the index over the top would make it just as complex as jumprope anyway. Eh!) |
The hard part isn't the API, unfortunately. The issue is that the internal engine implementations currently assume a contiguous buffer, and it's non-trivial to update them to work with non-contiguous buffers. (I made an attempt at doing so for the regex crate's pike vm a while back, so I'm speaking from experience.) Designing the API is easy. It's updating all the low-level core code that's tricky. |
I want to be clear here that searching streams is not really a problem I'm working on right now. My main goal is getting The NFA engines (PikeVM and backtracker) are harder. They do not currently expose "byte at a time" APIs. So it's not really possible to implement stream searching with how they currently work. Putting aside the backtracker because you probably don't want that anyway (it only works for small regexes/haystacks), the PikeVM is probably the answer to "execute any regex on a stream." But in order to do it, either the API needs to evolve to support streams, or y'all will need to build your own PikeVM. I strongly encourage you to do the latter if the lazy DFA is insufficient for your use case. For a few practical reasons:
@josephg Yes, APIs like that have been proposed in that regex issue thread. Consider though that "stream" doesn't just mean "search on text in memory but that is not stored contiguously." It might also mean "search something that implements And indeed, updating the regex engines to support that API is a daunting prospect. A major part of the perf of a regex engine is literal prefilters, and it is crazy difficult to make these work with APIs such as yours (even adaptations that expose slices of bytes). There are many issues, but consider one: if you use a prefilter on a slice of bytes and it finds nothing. Now you move on to the next slice and find nothing. But oops, you might have just missed a match that was spread across the boundary of the slices. If you were just running a state machine, then your state machine would be in a specific state after the end of the first slice, and then pick up where it left off at the beginning of the next and find the match. But once you add prefilters to the mix, this conceptual model breaks down. Fundamentally, this why I think a lot people consider stream searching to be feasible. It's because their mental model of how the regex engine actually works is extremely imprecise. Even something as simple as Aho-Corasick doesn't fit this mental model because it to also has prefilters. And indeed, the That Hyperscan manages to do this is no small miracle. If you look at its API reference and search for Then you have the problem of how you tie all this together. It's tempting to consider your API proposal (or one similar to it) as the "fundamental" API. But as you can see, it doesn't work well with prefilters. So does that mean the internal regex engines need two entirely different code paths? I dunno. I don't know how to square that circle. It is for this reason that I think folks should go off and figure out how to make
|
Oh and yeah, I don't commit to time tables for releases. I do hope for |
Thanks for all the details, @BurntSushi!
That makes a lot of sense, thanks! After My gut feeling is that for text editors just a straight up PikeVM is probably the way to go. Granted, it's slower than a DFA, and then even slower again than a DFA + literal optimizations. But functionally it covers all the use cases, and I suspect is still more than fast enough since typically you aren't trying to crunch through an entire repo's worth of text in a text editor. (There's always the occasional huge file, of course, but I feel like those cases are best dealt with by making the search non-blocking.) |
Yes, that's what I meant with "I am unlikely to have any bandwidth any time soon to work on the streaming API evolution required." And in particular, I'd really want to see a real implementation working. That will help a lot with API design. And I should also say, it's not clear to me that I will ever evolve the APIs. It might just be something that needs to be done by someone else, or built bespoke for each. With respect to using the PikeVM, I'll note these things:
But you are of course right that most searches in a text editor will likely be on small files. |
So, just to make sure we're on the same page: what I'm talking about is not the I'm only talking about changing the lower-level PikeVM in In any case, regardless, it sounds like implementing a separate PikeVM is a good way to go, for exploration etc. as you pointed out. Thanks! |
Yes, that's what I'm talking about too. With that said, I don't mean this in a "don't talk to me or ask me questions about this." I would very much welcome anyone working on this to open a new Discussion on the regex repo where we can chat. Just because I probably won't have the bandwidth any time soon to work on the streaming API use case doesn't mean I won't be amenable to talking about it or even doing smaller forms of API evolution. For example, if you get to a point where you're like, "well I could do things this way, but only if that internal routine in |
The two main options for a buffer struct is a gap buffer, or a rope. They both come with different tradeoffs. I really like this comment from the author of a Rust rope crate.
Like the author said, gap buffers have great performance. And they are much simpler then other data structures like ropes. It also makes things like regex much more performant.
Given all that, I think sticking with a gap buffer is the right choice. Most of complaints about Gap buffers is just misunderstanding.
edit: fixed misquote
The text was updated successfully, but these errors were encountered: