Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Bad behaviour with combination of: StringScanner#scan, multiline regexp, multiple lines #2616

Open
hmdne opened this issue Nov 25, 2023 · 2 comments · May be fixed by #2610
Open

Bug: Bad behaviour with combination of: StringScanner#scan, multiline regexp, multiple lines #2616

hmdne opened this issue Nov 25, 2023 · 2 comments · May be fixed by #2610
Labels

Comments

@hmdne
Copy link
Member

hmdne commented Nov 25, 2023

Describe the bug

When giving StringScanner#scan a multiline (//m) regexp, instead of scanning just the next character, it scans characters after a newline too.

In Ruby, multiline modifier changes semantics of . to mean "any character", not "any character except \n".

In JavaScript instead, it changes semantics of ^ and $ to how they work in Ruby even without //m, so they match beginning/ending either of string or line. Except... there's no \A or \z.

Opal's StringScanner#scan implementation adds ^ to the beginning of the supplied regexp and copies the flags.

All combined cause a hard-to-discover bug. I don't have simple solutions for that issue, maybe except of using lookbehind to simulate \A in Opal's clone of StringScanner, but that only recently has been implemented in Safari.

Opal version: master

To Reproduce

[user@localhost mnt]# opal -rstrscan -e 'StringScanner.new("hello\nhello").tap { |i| p 10.times.map { i.scan(/h/m) } }'
["h", "h", "h", "h", "h", "h", "h", nil, nil, nil]
[user@localhost mnt]# ruby -rstrscan -e 'StringScanner.new("hello\nhello").tap { |i| p 10.times.map { i.scan(/h/m) } }'
["h", nil, nil, nil, nil, nil, nil, nil, nil, nil]
[user@localhost mnt]# 
@hmdne hmdne added the bug label Nov 25, 2023
@takaram
Copy link
Member

takaram commented Nov 26, 2023

The root cause seems to be confusing m flag of Ruby and JavaScript.

Actually m flag of Ruby only changes the semantics of . and does not change ^ or $.
Equivalent flag in JavaScript is s.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions#advanced_searching_with_flags

@hmdne
Copy link
Member Author

hmdne commented Nov 26, 2023

Yes @takaram, I was also mistaken about this in the first version of the post. Later I edited it to clarify. But I wasn't aware of the JS s flag. #2610 fixes this issue by transpiling the regexps (but still needs some work due to performance impact). As of now, . is transpiled to [\s\S] if m flag is present. I foresee an issue with things like Regexp.union if we chose to depend on the flags.

@hmdne hmdne linked a pull request Nov 28, 2023 that will close this issue
hmdne added a commit to plurimath/plurimath-js that referenced this issue Nov 28, 2023
This patch cherry-picks updated version of opal/opal#2610. This
update cleaned up Opal's regexp implementation quite a lot to fix
opal/opal#2616 which in turn caused that we got 2 more tests passed
on Plurimath and only the obvious failures remain on Parslet (that
is after re-enabling some previously disabled tests, most of which
pass now). This patch also removes the fix from #22, as it's not
needed anymore.

This PR has been sponsored by Ribose Inc.
@hmdne hmdne reopened this Nov 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants