`char_range()` function for indexing into source strings #457

davidanthoff · 2024-07-11T09:36:33Z

We have a crash from the VS Code extension that seems to originate from JuliaSyntax. Repo steps are:

pkg> dev Sunny
Run

using JuliaSyntax

src = read(joinpath(homedir(), ".julia/dev/Sunny/test/test_tensors.jl"), String)

tree = parseall(SyntaxNode, src, filename="foo.jl")

i = JuliaSyntax.last_byte(tree[2][end][end])

src[i]

That crashes with

ERROR: StringIndexError: invalid index [2142], valid nearby indices [2140]=>'′', [2143]=>'\r'

I assume that last_byte should always return a valid index, right? So this seems like a bug.

The rest of this issue is some speculation from my end, might all be wrong :)

I started looking a bit around in the JuliaSyntax code, and there seems to be a fair bit of code that assumes 1-byte characters, which strikes me as incorrect? Or maybe this is carefully only done when there is a guarantee that only 1-byte codepoints can appear? An example is

JuliaSyntax.jl/src/parse_stream.jl

Line 519 in 1d95081

span(t::FullToken) = 1 + last_byte(t) - first_byte(t)

Doesn't that (incorrectly) assume that the character at index position last_byte is only one code unit? Generally if I search the repo for - 1 or + 1 I see a fair bit of code where I would have assumed that a prevind or nextind would be needed?

The text was updated successfully, but these errors were encountered:

KristofferC · 2024-07-11T11:53:33Z

Why would indexing into a string with the last_byte make sense? Seems the same conceptually as doing "α"[sizeof("α")].

davidanthoff · 2024-07-11T12:10:37Z

Yeah, I also realized, I should probably be using range instead of using first_byte and last_byte. At least for range I assume it should return valid indices, right? My PR would fix that, the repo for the problem with range is:

using JuliaSyntax

src = read(joinpath(homedir(), ".julia/dev/Sunny/test/test_tensors.jl"), String)

tree = parseall(SyntaxNode, src, filename="foo.jl")

code_range = range(tree[2][end][end])

src[code_range]

which also errors.

DaniGlez · 2024-07-13T11:06:16Z

If anyone else is facing this issue, a way to bypass it until there's a permanent fix is to find your lines ending in non-ASCII Unicode characters and make that no longer the case.

MilesCranmer · 2024-07-15T19:54:31Z

Thanks @DaniGlez!

Here's a regexp (VSCode flavour) for finding non-ASCII characters that end lines:

[^\x00-\x7f]$

c42f · 2024-07-16T06:45:53Z

there seems to be a fair bit of code that assumes 1-byte characters

JuliaSyntax works on byte indices internally and it's (usually) quite careful to convert to valid character indices in the occasional cases where this is required.

So unicode generally works and there's a pile of tests for this in test/source_files.jl

However there must be a bug in an edge case for non-ascii at line end

c42f · 2024-07-16T07:25:28Z

Nice properties of byte ranges are that
(a) the green tree exactly covers the source buffer with unit ranges, with no gaps
(b) the ranges (in principle) require no further interpreting of unicode chars to pull out a string (not yet exploited in practice! as this bug shows!)

davidanthoff · 2024-07-16T09:20:29Z

I think all I'm looking for is some function that returns a range that has valid string indices and can be used directly with the underlying string. Does that exist? If I understand @c42f's comment in my PR correct, then Base.range (to be renamed to byte_range) is also not supposed to do that, so is there any other way?

As a user of this, I would also say that at the level of SyntaxNodes, users are probably thinking in terms of string indices, not byte indices. I understand that byte stuff makes sense at the green tree level, but isn't that an abstraction leakage if I have to deal with this stuff when dealing with SyntaxNodes? I think it would be nice if there was a (documented) API at the SyntaxNode level that just operates on string index level.

For now I'm going to hack around all of this downstream, we can't have this bug crash in the wild, but it would be much nicer if there was some "official" way to get these indices.

c42f · 2024-07-17T05:05:23Z

I think all I'm looking for is some function that returns a range that has valid string indices and can be used directly with the underlying string. Does that exist?

Not exactly. Here's what exists so far:

JuliaSyntax uses SyntaxNode ranges in conjunction with SourceFile which does the index conversion (as well as other useful stuff like line and column number computation).
If you just want the source text covered by the node you can use sourcetext(node).

If I understand @c42f's comment in my PR correct, then Base.range (to be renamed to byte_range) is also not supposed to do that

Correct. By design, JuliaSyntax works in terms of byte ranges and only does the conversion to character indices when necessary.

For now I'm going to hack around all of this downstream

Ok. If you can point me at the place you're using this downstream, I'm sure we can work out the right design to unhack things in the future :-)

davidanthoff · 2024-07-17T22:21:17Z

My hack is https://github.com/julia-vscode/TestItemDetection.jl/blob/e19b96801a10718ca39b2c15cad18d0aae123993/src/packagedef.jl#L2. Essentially I just want that :)

In terms of design, I generally just guess that "end-users" of this package that use SyntaxNode want to deal with string indices, not byte indices. From the consumer's point of view, the whole byte index stuff strikes me as an implementation detail that I shouldn't really be exposed to in terms of the API I deal with :)

I saw the SourceFile type. We have a very similar thing in JuliaWorkspaces, and it is a bit different now than I originally described. Would be nice if we could unify all of that at some point to use the same types.

c42f · 2024-07-18T02:02:12Z

Essentially I just want that :)

Right - this is pretty much what SourceFile does here:

JuliaSyntax.jl/src/source_files.jl

Line 114 in a63e8bb

j = prevind(source.code, last(rng) + 1 - source.byte_offset)

Except that SourceFile also supports a byte offset (for when we are logically only parsing a piece of the text, not the whole thing). By the way, SourceFile has prevind, etc, supported.

I feel the core issue here is the question "what's a good representation of source code?" SourceFile isn't the whole answer, and neither is a string (probably). Essentially we're back to wanting a proper answer to #190 ?

c42f · 2024-07-18T02:04:57Z

For now I guess a quick fix would be to add a char_range() function which would do what your our_range() does? It's not wonderful (not a solution to #190), but it would be an improvement.

For GreenNode, char_range would need additional file::SourceFile and position::Int arguments. Or some such.

davidanthoff · 2024-07-18T08:43:39Z

For now I guess a quick fix would be to add a char_range() function

Yes, I think that would be great.

For GreenNode, char_range would need additional file::SourceFile and position::Int arguments.

I think it would also be reasonable to say that the "char"-based interface is only available at the SyntaxNode level, if someone is digging down to the green node level, they just have to deal with this themselves? That is the more low level interface in any case.

This was referenced Jul 11, 2024

TestItemRunner v1.0.0 breaks compatibility with Julia 1.10 release branch? julia-vscode/TestItemRunner.jl#82

Open

Ensure range returns valid indices #458

Closed

c42f mentioned this issue Jul 31, 2024

Release 1.0.0 #472

Open

c42f changed the title ~~Incorrect unicode handling~~ char_range() function for indexing into source strings Jul 31, 2024

c42f mentioned this issue Aug 8, 2024

Implement char_range() to respect string indices #486

Merged

c42f closed this as completed in #486 Aug 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`char_range()` function for indexing into source strings #457

`char_range()` function for indexing into source strings #457

davidanthoff commented Jul 11, 2024

KristofferC commented Jul 11, 2024 •

edited

Loading

davidanthoff commented Jul 11, 2024

DaniGlez commented Jul 13, 2024

MilesCranmer commented Jul 15, 2024 •

edited

Loading

c42f commented Jul 16, 2024

c42f commented Jul 16, 2024

davidanthoff commented Jul 16, 2024

c42f commented Jul 17, 2024

davidanthoff commented Jul 17, 2024

c42f commented Jul 18, 2024

c42f commented Jul 18, 2024 •

edited

Loading

davidanthoff commented Jul 18, 2024

char_range() function for indexing into source strings #457

char_range() function for indexing into source strings #457

Comments

davidanthoff commented Jul 11, 2024

KristofferC commented Jul 11, 2024 • edited Loading

davidanthoff commented Jul 11, 2024

DaniGlez commented Jul 13, 2024

MilesCranmer commented Jul 15, 2024 • edited Loading

c42f commented Jul 16, 2024

c42f commented Jul 16, 2024

davidanthoff commented Jul 16, 2024

c42f commented Jul 17, 2024

davidanthoff commented Jul 17, 2024

c42f commented Jul 18, 2024

c42f commented Jul 18, 2024 • edited Loading

davidanthoff commented Jul 18, 2024

`char_range()` function for indexing into source strings #457

`char_range()` function for indexing into source strings #457

KristofferC commented Jul 11, 2024 •

edited

Loading

MilesCranmer commented Jul 15, 2024 •

edited

Loading

c42f commented Jul 18, 2024 •

edited

Loading