Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the handling of multibyte characters #2051

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 10 additions & 4 deletions lib/ruby_lsp/document.rb
Original file line number Diff line number Diff line change
Expand Up @@ -146,16 +146,20 @@ def locate(node, char_position, node_types: [])

# Skip if the current node doesn't cover the desired position
loc = candidate.location
next unless (loc.start_offset...loc.end_offset).cover?(char_position)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start_offset and end_offset are a lot less costly to calculate than start_code_units_offset and end_code_units_offset. Could we instead convert char_position into a byte offset and keep the existing code the same?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments.
In my understanding, the char_position is provided by the editor, so we can't convert it into a byte offset :(
Alternatively, we can memorize the loc.start_code_units_offset and others to reduce the calculation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the variable though, so we can convert it using our own transformation right? I think the general transformation would be:

  • UTF-8: source.slice(0, code_units_offset).length
  • UTF-16: code_units_offset / 2
  • UTF-32: code_units_offset / 4

I'm not 100% sure, but I think that's it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kddnewton
Apologies for the delayed response. I have fixed the issue to convert byte offsets in this commit: be648a6. Please review it.

next unless (loc.start_code_units_offset(encoding)...loc.end_code_units_offset(encoding)).cover?(char_position)

# If the node's start character is already past the position, then we should've found the closest node
# already
break if char_position < loc.start_offset
break if char_position < loc.start_code_units_offset(encoding)

# If the candidate starts after the end of the previous nesting level, then we've exited that nesting level and
# need to pop the stack
previous_level = nesting_nodes.last
nesting_nodes.pop if previous_level && loc.start_offset > previous_level.location.end_offset
if previous_level
start_offset = loc.start_code_units_offset(encoding)
end_offset = previous_level.location.end_code_units_offset(encoding)
nesting_nodes.pop if start_offset > end_offset
end

# Keep track of the nesting where we found the target. This is used to determine the fully qualified name of the
# target when it is a constant
Expand All @@ -182,7 +186,9 @@ def locate(node, char_position, node_types: [])

# If the current node is narrower than or equal to the previous closest node, then it is more precise
closest_loc = closest.location
if loc.end_offset - loc.start_offset <= closest_loc.end_offset - closest_loc.start_offset
target_length = loc.end_code_units_offset(encoding) - loc.start_code_units_offset(encoding)
closest_length = closest_loc.end_code_units_offset(encoding) - closest_loc.start_code_units_offset(encoding)
if target_length <= closest_length
parent = closest
closest = candidate
end
Expand Down
3 changes: 2 additions & 1 deletion lib/ruby_lsp/requests/definition.rb
Original file line number Diff line number Diff line change
Expand Up @@ -75,9 +75,10 @@ def initialize(document, global_state, position, dispatcher, typechecker_enabled
target,
parent,
position,
global_state.encoding,
)
elsif target.is_a?(Prism::CallNode) && target.name != :require && target.name != :require_relative &&
!covers_position?(target.message_loc, position)
!covers_position?(target.message_loc, position, global_state.encoding)
# If the target is a method call, we need to ensure that the requested position is exactly on top of the
# method identifier. Otherwise, we risk showing definitions for unrelated things
target = nil
Expand Down
3 changes: 2 additions & 1 deletion lib/ruby_lsp/requests/hover.rb
Original file line number Diff line number Diff line change
Expand Up @@ -52,9 +52,10 @@ def initialize(document, global_state, position, dispatcher, typechecker_enabled
T.must(target),
T.must(parent),
position,
global_state.encoding,
)
elsif target.is_a?(Prism::CallNode) && target.name != :require && target.name != :require_relative &&
!covers_position?(target.message_loc, position)
!covers_position?(target.message_loc, position, global_state.encoding)

target = nil
end
Expand Down
27 changes: 17 additions & 10 deletions lib/ruby_lsp/requests/request.rb
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look style related only. Can we remove them to make it easier to review?

Original file line number Diff line number Diff line change
Expand Up @@ -18,19 +18,19 @@ def perform; end
private

# Checks if a location covers a position
sig { params(location: Prism::Location, position: T.untyped).returns(T::Boolean) }
def cover?(location, position)
sig { params(location: Prism::Location, position: T.untyped, encoding: Encoding).returns(T::Boolean) }
def cover?(location, position, encoding)
Copy link
Contributor

@andyw8 andyw8 May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we pass the encoding to the Request initializer, and save it as an instance variable, so that we don't need to pass it around in methods as much?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your quickly review.
For now, the requests have no common attributes. We only need encoding for Requests::Hover and Requests::Definition, perhaps. So, I think it's a bit too much for the request initializer to have an encoding attribute.

However, if it’s global_state, it might be okay since it’s already used by multiple requests. Alternatively, we can pass the global_state to the request initializer instead of encoding. If it is global_state, it wouldn't be odd to maintain the state within the request, in my opinion. What do you think?

Copy link
Author

@NotFounds NotFounds May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andyw8 Since the position is now pre-calculated using byte offsets, there is no longer a need to pass the encoding to each method.

ref: be648a6

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andyw8 @vinistock
Sorry for the delay. Upon reconsideration, I believe it might be better not to have the Request initializer hold encoding/global_state in this PR. Only two subclasses currently need these attributes. Making this change would lead to a large scope of modifications. What do you think?

start_covered =
location.start_line - 1 < position[:line] ||
(
location.start_line - 1 == position[:line] &&
location.start_column <= position[:character]
location.start_code_units_column(encoding) <= position[:character]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as above here, could we convert position[:character] into a byte offset before we check cover? here?

)
end_covered =
location.end_line - 1 > position[:line] ||
(
location.end_line - 1 == position[:line] &&
location.end_column >= position[:character]
location.end_code_units_column(encoding) >= position[:character]
)
start_covered && end_covered
end
Expand All @@ -50,15 +50,16 @@ def cover?(location, position)
target: Prism::Node,
parent: Prism::Node,
position: T::Hash[Symbol, Integer],
encoding: Encoding,
).returns(Prism::Node)
end
def determine_target(target, parent, position)
def determine_target(target, parent, position, encoding)
return target unless parent.is_a?(Prism::ConstantPathNode)

target = T.let(parent, Prism::Node)
parent = T.let(T.cast(target, Prism::ConstantPathNode).parent, T.nilable(Prism::Node))

while parent && cover?(parent.location, position)
while parent && cover?(parent.location, position, encoding)
target = parent
parent = target.is_a?(Prism::ConstantPathNode) ? target.parent : nil
end
Expand All @@ -67,17 +68,23 @@ def determine_target(target, parent, position)
end

# Checks if a given location covers the position requested
sig { params(location: T.nilable(Prism::Location), position: T::Hash[Symbol, T.untyped]).returns(T::Boolean) }
def covers_position?(location, position)
sig do
params(
location: T.nilable(Prism::Location),
position: T::Hash[Symbol, T.untyped],
encoding: Encoding,
).returns(T::Boolean)
end
def covers_position?(location, position, encoding)
return false unless location

start_line = location.start_line - 1
end_line = location.end_line - 1
line = position[:line]
character = position[:character]

(start_line < line || (start_line == line && location.start_column <= character)) &&
(end_line > line || (end_line == line && location.end_column >= character))
(start_line < line || (start_line == line && location.start_code_units_column(encoding) <= character)) &&
(end_line > line || (end_line == line && location.end_code_units_column(encoding) >= character))
end
end
end
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{
"result": [
{
"uri": "file:///fixtures/multibyte_characters.rb",
"range": {
"start": {
"line": 2,
"character": 2
},
"end": {
"line": 8,
"character": 5
}
}
}
],
"params": [
{
"line": 1,
"character": 7
}
]
}
2 changes: 2 additions & 0 deletions test/fixtures/multibyte_characters_reference.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# テスト
A動物::C貓咪.叫
9 changes: 9 additions & 0 deletions test/requests/definition_expectations_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,15 @@ def run_expectations(source)
),
),
)
index.index_single(
RubyIndexer::IndexablePath.new(
"#{Dir.pwd}/lib",
File.expand_path(
"../../test/fixtures/multibyte_characters.rb",
__dir__,
),
),
)

server.process_message(
id: 1,
Expand Down
17 changes: 17 additions & 0 deletions test/ruby_document_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -558,6 +558,23 @@ def baz
assert_equal(["Foo", "Bar"], node_context.nesting)
end

def test_locate_when_multibyte_characters
document = RubyLsp::RubyDocument.new(source: <<~RUBY, version: 1, uri: URI("file:///foo/bar.rb"))
module A動物
class B猫
def 鳴く
puts "にゃー"
end
end
end
RUBY

node_context = document.locate_node({ line: 1, character: 8 })
found = node_context.node
assert_equal("B猫", T.cast(found, Prism::ConstantReadNode).location.slice)
assert_equal(["A動物", "B猫"], node_context.nesting)
end

def test_reparsing_without_new_edits_does_nothing
text = "def foo; end"

Expand Down
Loading