Fix the handling of multibyte characters #2051

NotFounds · 2024-05-15T16:43:36Z

Motivation

refs: #2021 (comment)

When a Ruby file contains multibyte characters (like Japanese, Chinese, emoji, etc), the go to definition and hover features do not work correctly. Because the document referencing logic does not properly handle multibyte characters when calculating offsets.

This commit fixes the issue by:

Modifying document referencing to properly handle multibyte characters when mapping between positions and offsets
Adding test cases to verify go to definition work with multibyte characters

Implementation

Modified RubyLsp::Document and RubyLsp::Requests::Request to calculate locations considering multibyte characters. This change utilizes the API implemented in Prism by Add code unit APIs to location ruby/prism#2406.
By passing the encoding when creating the Index, the index will be built taking multibyte characters into account.

Automated Tests

I added some tests for the definition and RubyLsp::RubyDocument.

Manual Tests

The definition jump for the following code snippet from #1347 works in this branch, but does not function correctly in the main branch.

class Test
  TEST = 'test'

  def method1
    # テスト
    pp TEST
  end
end

When a Ruby file contains multibyte characters (like Japanese, Chinese, emoji, etc), the go to definition and hover features do not work correctly. Because the document referencing logic does not properly handle multibyte characters when calculating offsets. This commit fixes the issue by: * Modifying document referencing to properly handle multibyte characters when mapping between positions and offsets * Adding test cases to verify go to definition work with multibyte characters

andyw8 · 2024-05-15T17:15:38Z

cc @kddnewton in case you have any views on the Prism usage here.

andyw8 · 2024-05-15T17:18:19Z

lib/ruby_lsp/requests/request.rb

-      sig { params(location: Prism::Location, position: T.untyped).returns(T::Boolean) }
-      def cover?(location, position)
+      sig { params(location: Prism::Location, position: T.untyped, encoding: Encoding).returns(T::Boolean) }
+      def cover?(location, position, encoding)


Could we pass the encoding to the Request initializer, and save it as an instance variable, so that we don't need to pass it around in methods as much?

Thanks for your quickly review.
For now, the requests have no common attributes. We only need encoding for Requests::Hover and Requests::Definition, perhaps. So, I think it's a bit too much for the request initializer to have an encoding attribute.

However, if it’s global_state, it might be okay since it’s already used by multiple requests. Alternatively, we can pass the global_state to the request initializer instead of encoding. If it is global_state, it wouldn't be odd to maintain the state within the request, in my opinion. What do you think?

~~@andyw8 Since the position is now pre-calculated using byte offsets, there is no longer a need to pass the encoding to each method.~~

~~ref: be648a6~~

@andyw8 @vinistock
Sorry for the delay. Upon reconsideration, I believe it might be better not to have the Request initializer hold encoding/global_state in this PR. Only two subclasses currently need these attributes. Making this change would lead to a large scope of modifications. What do you think?

kddnewton · 2024-05-15T18:01:54Z

lib/ruby_lsp/document.rb

@@ -142,16 +142,20 @@ def locate(node, char_position, node_types: [])

        # Skip if the current node doesn't cover the desired position
        loc = candidate.location
-        next unless (loc.start_offset...loc.end_offset).cover?(char_position)


start_offset and end_offset are a lot less costly to calculate than start_code_units_offset and end_code_units_offset. Could we instead convert char_position into a byte offset and keep the existing code the same?

Thanks for your comments.
In my understanding, the char_position is provided by the editor, so we can't convert it into a byte offset :(
Alternatively, we can memorize the loc.start_code_units_offset and others to reduce the calculation.

We have the variable though, so we can convert it using our own transformation right? I think the general transformation would be:

UTF-8: source.slice(0, code_units_offset).length

UTF-16: code_units_offset / 2

UTF-32: code_units_offset / 4

I'm not 100% sure, but I think that's it.

@kddnewton
Apologies for the delayed response. I have fixed the issue to convert byte offsets in this commit: be648a6. Please review it.

kddnewton · 2024-05-15T18:02:38Z

lib/ruby_lsp/requests/request.rb

        start_covered =
          location.start_line - 1 < position[:line] ||
          (
            location.start_line - 1 == position[:line] &&
-              location.start_column <= position[:character]
+              location.start_code_units_column(encoding) <= position[:character]


Same question as above here, could we convert position[:character] into a byte offset before we check cover? here?

lib/ruby_lsp/document.rb

lib/ruby_lsp/test_helper.rb

lib/ruby_lsp/utils.rb

NotFounds · 2024-06-04T16:49:58Z

It appears that from 'be648a6' onwards, it no longer works correctly on VSCode.
I'll try to fix it. I would appreciate any advice you can provide.

vinistock

The idea to convert the editor's position to a byte offset is interesting from a performance perspective, but I'm not sure it's the right approach. It seems reversed in my opinion.

For example, if we fix the locations in the indexer to use code units (which is required for us to return selection ranges properly), then implementing features like find references and rename could prove a bit weird because we would need to compare a byte offset with a code unit.

Or we would need to not apply the same conversion only in those cases.

I think I'd rather favour consistency of using the code unit APIs everywhere for now and if we find performance issues we can try to address those.

It's also relevant to mention that the code unit APIs are only less performant when there are unicode characters, since they require more complicated handling. For ASCII only sources, the performance should be pretty much the same.

vinistock · 2024-06-04T17:19:37Z

lib/ruby_lsp/requests/request.rb

These changes look style related only. Can we remove them to make it easier to review?

lib/ruby_lsp/test_helper.rb

NotFounds · 2024-06-05T00:51:06Z

@vinistock
Thanks for your comment!

I think I'd rather favour consistency of using the code unit APIs everywhere for now and if we find performance issues we can try to address those.

It sounds good to me. I'll reset or revert the commits related to byte offsets (be648a6..ff4d196).
Additionally, I'll look into refactoring the code based on this review comment (#2051 (comment)).

github-actions · 2024-09-01T12:42:17Z

This pull request is being marked as stale because there was no activity in the last 2 months

NotFounds · 2024-09-16T00:29:52Z

@vinistock
I would appreciate it if you could kindly provide your comments on this PR. Should we wait for the implementation of ResourceUri?

vinistock · 2024-09-23T13:57:21Z

@NotFounds Hi! I'm sorry, I got caught up with other priorities and dropped the ball on this one. Let's move forward with this PR before the resource URI changes since this fixes actual issues and the resource URI approach is mostly a correctness refactor.

Can you please re-open this PR (or a new one, whatever is easier) and I'll help push it over the finish line?

Let's take the approach of saving the code units with the new Prism API in the index. So essentially, we need to:

Configure the index with the encoding that was negotiated between editor and server. After setting the global state encoding here, you want to grab the @index.configuration object and set the encoding there, so that we can check what is the encoding being used during indexing
You then need to pass the encoding (or maybe the entire config object?) to the declaration listener, where we will use the encoding to invoke the Prism location API location.start_code_units_column(encoding) to get the proper locations for multibyte characters
Finally, we should add a few tests to ensure that we don't accidentally regress. One test per entity type should be okay. These would be:

NotFounds · 2024-09-24T03:12:14Z

@vinistock
Thank you for your response. I'm glad we can move forward with the implementation of this feature! Since the changes seem significant, I'll create a new Pull Request.

I have a few questions:

In the declaration listener, should I retrieve the encoding using something like @index.configuration.encoding? I planned to include the encoding in the Configuration in step 1.
Is considering multibyte code units during indexing primarily for performance reasons? (Just curious)

vinistock · 2024-09-24T14:02:42Z

In the declaration listener, should I retrieve the encoding using something like @index.configuration.encoding? I planned to include the encoding in the Configuration in step 1.

Yes, that's correct. Let's pass the entire configuration object to the declaration listener. That way, if more configurations are added, it will already be able to access all of them.

Is considering multibyte code units during indexing primarily for performance reasons? (Just curious)

It's unfortunately a trade off of performance and correctness. If you have as much as a single multibyte character in a document, the entire source must be considered multibyte and computing the locations is significantly more expensive. If you don't have any multibyte characters, then Prism uses an ASCII-only optimization which is much much faster.

However, from a correctness standpoint, we need to be able to index declarations made using multibyte characters, especially for languages that need these characters like Japanese.

Also, having the correct code unit locations means that all features that depend on index entries will just work (hover, definition, completion, signature help, workspace symbol).

NotFounds requested a review from a team as a code owner May 15, 2024 16:43

NotFounds requested review from andyw8 and vinistock May 15, 2024 16:43

NotFounds mentioned this pull request May 15, 2024

Fix go to definition and hover for files containing multibyte characters #2021

Closed

andyw8 reviewed May 15, 2024

View reviewed changes

andyw8 added bugfix This PR will fix an existing bug server This pull request should be included in the server gem's release notes labels May 15, 2024

kddnewton reviewed May 15, 2024

View reviewed changes

NotFounds commented May 23, 2024

View reviewed changes

lib/ruby_lsp/document.rb Outdated Show resolved Hide resolved

NotFounds commented Jun 4, 2024

View reviewed changes

lib/ruby_lsp/test_helper.rb Outdated Show resolved Hide resolved

lib/ruby_lsp/utils.rb Outdated Show resolved Hide resolved

NotFounds requested review from kddnewton and andyw8 June 4, 2024 16:06

NotFounds marked this pull request as draft June 4, 2024 16:26

vinistock reviewed Jun 4, 2024

View reviewed changes

Merge branch 'main' into fix-handling-of-multibyte-characters

052d674

NotFounds force-pushed the fix-handling-of-multibyte-characters branch from ff4d196 to 052d674 Compare June 5, 2024 14:18

NotFounds marked this pull request as ready for review June 18, 2024 08:15

Merge branch 'main' into fix-handling-of-multibyte-characters

f8381e0

NotFounds requested a review from vinistock July 3, 2024 06:29

vinistock mentioned this pull request Aug 9, 2024

Replace IndexablePath with ResourceUri concept #2423

Merged

github-actions bot added the Stale label Sep 1, 2024

github-actions bot closed this Sep 15, 2024

vinistock mentioned this pull request Sep 23, 2024

Definition jumps are not possible with files containing Japanese characters. #1347

Closed

NotFounds mentioned this pull request Sep 25, 2024

Handle multibyte characters in indexing #2619

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the handling of multibyte characters #2051

Fix the handling of multibyte characters #2051

NotFounds commented May 15, 2024

andyw8 commented May 15, 2024

andyw8 May 15, 2024 •

edited

Loading

NotFounds May 16, 2024

NotFounds May 23, 2024 •

edited

Loading

NotFounds Jun 18, 2024

kddnewton May 15, 2024

NotFounds May 16, 2024

kddnewton May 16, 2024

NotFounds May 23, 2024

kddnewton May 15, 2024

NotFounds commented Jun 4, 2024

vinistock left a comment •

edited

Loading

vinistock Jun 4, 2024

NotFounds commented Jun 5, 2024

github-actions bot commented Sep 1, 2024

NotFounds commented Sep 16, 2024

vinistock commented Sep 23, 2024 •

edited

Loading

NotFounds commented Sep 24, 2024

vinistock commented Sep 24, 2024

Fix the handling of multibyte characters #2051

Fix the handling of multibyte characters #2051

Conversation

NotFounds commented May 15, 2024

Motivation

Implementation

Automated Tests

Manual Tests

andyw8 commented May 15, 2024

andyw8 May 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NotFounds May 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NotFounds commented Jun 4, 2024

vinistock left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NotFounds commented Jun 5, 2024

github-actions bot commented Sep 1, 2024

NotFounds commented Sep 16, 2024

vinistock commented Sep 23, 2024 • edited Loading

NotFounds commented Sep 24, 2024

vinistock commented Sep 24, 2024

andyw8 May 15, 2024 •

edited

Loading

NotFounds May 23, 2024 •

edited

Loading

vinistock left a comment •

edited

Loading

vinistock commented Sep 23, 2024 •

edited

Loading