Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the handling of multibyte characters #2051

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

NotFounds
Copy link

Motivation

refs: #2021 (comment)

When a Ruby file contains multibyte characters (like Japanese, Chinese, emoji, etc), the go to definition and hover features do not work correctly. Because the document referencing logic does not properly handle multibyte characters when calculating offsets.

This commit fixes the issue by:

  • Modifying document referencing to properly handle multibyte characters when mapping between positions and offsets
  • Adding test cases to verify go to definition work with multibyte characters

Implementation

  1. Modified RubyLsp::Document and RubyLsp::Requests::Request to calculate locations considering multibyte characters. This change utilizes the API implemented in Prism by Add code unit APIs to location ruby/prism#2406.
  2. By passing the encoding when creating the Index, the index will be built taking multibyte characters into account.

Automated Tests

I added some tests for the definition and RubyLsp::RubyDocument.

Manual Tests

The definition jump for the following code snippet from #1347 works in this branch, but does not function correctly in the main branch.

class Test
  TEST = 'test'

  def method1
    # テスト
    pp TEST
  end
end

When a Ruby file contains multibyte characters (like Japanese, Chinese,
emoji, etc), the go to definition and hover features do not work
correctly. Because the document referencing logic does not properly
handle multibyte characters when calculating offsets.

This commit fixes the issue by:
* Modifying document referencing to properly handle multibyte characters
when mapping between positions and offsets
* Adding test cases to verify go to definition work with multibyte
characters
@andyw8
Copy link
Contributor

andyw8 commented May 15, 2024

cc @kddnewton in case you have any views on the Prism usage here.

sig { params(location: Prism::Location, position: T.untyped).returns(T::Boolean) }
def cover?(location, position)
sig { params(location: Prism::Location, position: T.untyped, encoding: Encoding).returns(T::Boolean) }
def cover?(location, position, encoding)
Copy link
Contributor

@andyw8 andyw8 May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we pass the encoding to the Request initializer, and save it as an instance variable, so that we don't need to pass it around in methods as much?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your quickly review.
For now, the requests have no common attributes. We only need encoding for Requests::Hover and Requests::Definition, perhaps. So, I think it's a bit too much for the request initializer to have an encoding attribute.

However, if it’s global_state, it might be okay since it’s already used by multiple requests. Alternatively, we can pass the global_state to the request initializer instead of encoding. If it is global_state, it wouldn't be odd to maintain the state within the request, in my opinion. What do you think?

Copy link
Author

@NotFounds NotFounds May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andyw8 Since the position is now pre-calculated using byte offsets, there is no longer a need to pass the encoding to each method.

ref: be648a6

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andyw8 @vinistock
Sorry for the delay. Upon reconsideration, I believe it might be better not to have the Request initializer hold encoding/global_state in this PR. Only two subclasses currently need these attributes. Making this change would lead to a large scope of modifications. What do you think?

@andyw8 andyw8 added bugfix This PR will fix an existing bug server This pull request should be included in the server gem's release notes labels May 15, 2024
@@ -142,16 +142,20 @@ def locate(node, char_position, node_types: [])

# Skip if the current node doesn't cover the desired position
loc = candidate.location
next unless (loc.start_offset...loc.end_offset).cover?(char_position)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start_offset and end_offset are a lot less costly to calculate than start_code_units_offset and end_code_units_offset. Could we instead convert char_position into a byte offset and keep the existing code the same?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments.
In my understanding, the char_position is provided by the editor, so we can't convert it into a byte offset :(
Alternatively, we can memorize the loc.start_code_units_offset and others to reduce the calculation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the variable though, so we can convert it using our own transformation right? I think the general transformation would be:

  • UTF-8: source.slice(0, code_units_offset).length
  • UTF-16: code_units_offset / 2
  • UTF-32: code_units_offset / 4

I'm not 100% sure, but I think that's it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kddnewton
Apologies for the delayed response. I have fixed the issue to convert byte offsets in this commit: be648a6. Please review it.

start_covered =
location.start_line - 1 < position[:line] ||
(
location.start_line - 1 == position[:line] &&
location.start_column <= position[:character]
location.start_code_units_column(encoding) <= position[:character]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as above here, could we convert position[:character] into a byte offset before we check cover? here?

when Encoding::UTF_8
source.slice(0, char_position).bytesize
when Encoding::UTF_16, Encoding::UTF_16LE, Encoding::UTF_16BE
char_position * 2
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to consider surrogate pairs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're okay. It's not actually a character position coming back from vscode, it's the number of code units. In UTF-16, a code point can be represented using 1 code unit (regular) which is 2 bytes or 2 code units (surrogate) which is 4 bytes. So in theory this should always work.

I think I may have been incorrect about the code units for UTF-8 though. I think maybe it's just the same as the number of bytes. We'll need to verify this though. We'll definitely need some testing for UTF-8 and UTF-32.

Since this logic is duplicated in 2 places, can you make a new object where this code can be shared?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for answering. I will work on making the logic common and adding tests.
I'm not very familiar with the directory structure and rules of ruby-lsp, but is it okay if I add the implementation to lib/ruby_lsp/utils.rb?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have refactored it and added some tests.
3cbb882 ff4d196

server.global_state.apply_options({
capabilities: {
general: {
positionEncodings: [LanguageServer::Protocol::Constant::PositionEncodingKind::UTF8],
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this pull request, the server is initialized in tests.
So we explicitly use UTF-8 for testing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous implementation, I couldn't find any place where UTF-8 was explicitly specified.
However, given that tests have been run so far, it seems that it was used the UTF-8.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The encoding was set in the initialize method to UTF-8, but indeed if we invoke apply_options with no arguments, it will change to UTF-16.

module_function

sig { params(source: String, char_position: Integer, encoding: Encoding).returns(Integer) }
def convert_to_byte_offset_position(source, char_position, encoding)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unsure whether to implement this in utils.rb.
Please let me know if you think there is a more appropriate place.

@NotFounds NotFounds marked this pull request as draft June 4, 2024 16:26
@NotFounds
Copy link
Author

It appears that from 'be648a6' onwards, it no longer works correctly on VSCode.
I'll try to fix it. I would appreciate any advice you can provide.

Copy link
Member

@vinistock vinistock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea to convert the editor's position to a byte offset is interesting from a performance perspective, but I'm not sure it's the right approach. It seems reversed in my opinion.

For example, if we fix the locations in the indexer to use code units (which is required for us to return selection ranges properly), then implementing features like find references and rename could prove a bit weird because we would need to compare a byte offset with a code unit.

Or we would need to not apply the same conversion only in those cases.

I think I'd rather favour consistency of using the code unit APIs everywhere for now and if we find performance issues we can try to address those.

It's also relevant to mention that the code unit APIs are only less performant when there are unicode characters, since they require more complicated handling. For ASCII only sources, the performance should be pretty much the same.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look style related only. Can we remove them to make it easier to review?

server.global_state.apply_options({
capabilities: {
general: {
positionEncodings: [LanguageServer::Protocol::Constant::PositionEncodingKind::UTF8],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The encoding was set in the initialize method to UTF-8, but indeed if we invoke apply_options with no arguments, it will change to UTF-16.

@NotFounds
Copy link
Author

@vinistock
Thanks for your comment!

I think I'd rather favour consistency of using the code unit APIs everywhere for now and if we find performance issues we can try to address those.

It sounds good to me. I'll reset or revert the commits related to byte offsets (be648a6..ff4d196).
Additionally, I'll look into refactoring the code based on this review comment (#2051 (comment)).

@NotFounds NotFounds force-pushed the fix-handling-of-multibyte-characters branch from ff4d196 to 052d674 Compare June 5, 2024 14:18
@NotFounds NotFounds marked this pull request as ready for review June 18, 2024 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix This PR will fix an existing bug server This pull request should be included in the server gem's release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants