Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ruby] Allow non-ASCII character as heredoc delimiter initial character. #234

Closed
zufuliu opened this issue Apr 3, 2024 · 2 comments
Closed
Labels
committed Issue fixed in repository but not in release ruby Caused by the ruby lexer

Comments

@zufuliu
Copy link
Contributor

zufuliu commented Apr 3, 2024

The heredoc test case in issue #233 in UTF-8 mode is not highlighted correctly.

# encoding: utf-8
# currently OK
puts <<A
#{1+2}
A中

# currently BAD
puts <<中
#{1+2}

follow the comment "heredoc_identifier routine" at

lexilla/lexers/LexRuby.cxx

Lines 948 to 952 in cd5d866

if (!(strchr("\"\'`_-~", chNext2) || isSafeAlpha(chNext2))) {
// It's definitely not a here-doc,
// based on Ruby's lexer/parser in the
// heredoc_identifier routine.
// Nothing else to do.

static inline int
is_identchar(struct parser_params *p, const char *ptr, const char *MAYBE_UNUSED(ptr_end), rb_encoding *enc)
{
    return rb_enc_isalnum((unsigned char)*ptr, enc) || *ptr == '_' || !ISASCII(*ptr);
}

Here is the patch, changed SCE_RB_WORD to use the new isSafeAlphaOrHigh() function to match above is_identchar().
ruby-here-doc-0403.zip

diff --git a/lexers/LexRuby.cxx b/lexers/LexRuby.cxx
index d4bf314c..f5e1aec3 100644
--- a/lexers/LexRuby.cxx
+++ b/lexers/LexRuby.cxx
@@ -48,6 +48,10 @@ inline bool isSafeAlpha(char ch) noexcept {
     return (isSafeASCII(ch) && isalpha(ch)) || ch == '_';
 }
 
+inline bool isSafeAlphaOrHigh(char ch) noexcept {
+	return isHighBitChar(ch) || isalpha(ch) || ch == '_';
+}
+
 inline bool isSafeAlnum(char ch) noexcept {
     return (isSafeASCII(ch) && isalnum(ch)) || ch == '_';
 }
@@ -613,7 +617,7 @@ bool sureThisIsNotHeredoc(Sci_Position lt2StartPos, Accessor &styler) {
         j += 1;
     }
 
-    if (isSafeAlnum(styler[j])) {
+    if (isSafeAlnumOrHigh(styler[j])) {
         // Init target_end because some compilers think it won't
         // be initialized by the time it's used
         target_start = target_end = j;
@@ -622,7 +626,7 @@ bool sureThisIsNotHeredoc(Sci_Position lt2StartPos, Accessor &styler) {
         return definitely_not_a_here_doc;
     }
     for (; j < lengthDoc; j++) {
-        if (!isSafeAlnum(styler[j])) {
+        if (!isSafeAlnumOrHigh(styler[j])) {
             if (target_quote && styler[j] != target_quote) {
                 // unquoted end
                 return definitely_not_a_here_doc;
@@ -877,7 +881,7 @@ void ColouriseRbDoc(Sci_PositionU startPos, Sci_Position length, int initStyle,
                 styler.ColourTo(i - 1, state);
                 state = SCE_RB_NUMBER;
                 is_real_number = true;
-            } else if (isHighBitChar(ch) || iswordstart(ch)) {
+            } else if (isSafeAlphaOrHigh(ch)) {
                 styler.ColourTo(i - 1, state);
                 state = SCE_RB_WORD;
             } else if (ch == '#') {
@@ -945,7 +949,7 @@ void ColouriseRbDoc(Sci_PositionU startPos, Sci_Position length, int initStyle,
                 chNext = chNext2;
                 styler.ColourTo(i, SCE_RB_OPERATOR);
 
-                if (!(strchr("\"\'`_-~", chNext2) || isSafeAlpha(chNext2))) {
+                if (!(strchr("\"\'`_-~", chNext2) || isSafeAlphaOrHigh(chNext2))) {
                     // It's definitely not a here-doc,
                     // based on Ruby's lexer/parser in the
                     // heredoc_identifier routine.

Note: with the patch applied, when the test case is in DBCS code page, heredoc is not terminated due to DBCS characters been skipped, something like following screenshot:
image

@nyamatongwe nyamatongwe added the ruby Caused by the ruby lexer label Apr 5, 2024
@nyamatongwe
Copy link
Member

Hunk 4 starting line 877/881 appears unrelated and to have no effect.

@zufuliu
Copy link
Contributor Author

zufuliu commented Apr 5, 2024

Yes, it's unrelated, just a convenient change that matches parse_ident() at https://github.com/ruby/ruby/blob/master/parse.y#L10669

nyamatongwe pushed a commit that referenced this issue Apr 6, 2024
@nyamatongwe nyamatongwe added the committed Issue fixed in repository but not in release label Apr 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
committed Issue fixed in repository but not in release ruby Caused by the ruby lexer
Projects
None yet
Development

No branches or pull requests

2 participants