Avoid ambiguity in regexp-based extraction #211

nicolo-ribaudo · 2025-06-11T15:58:53Z

This PR changes:

the non-regexp-based logic to only allow single-line comment (matching browser implementations)
the regexp-based approach to start from the end and bail out whenever there is either code or a comment containing `, ', ", or /.

The notes still need to be updated.

sokra · 2025-06-11T16:32:04Z

spec.emu

@@ -1350,7 +1350,7 @@
        <emu-alg>
          1. Let _tokens_ be the List of tokens obtained by parsing _source_ according to <emu-xref href="#sec-ecmascript-language-lexical-grammar">ECMA-262's lexical grammar</emu-xref>.
          1. For each nonterminal _token_ in _tokens_, in reverse order, do
-            1. If _token_ is not |SingleLineComment| or |MultiLineComment|, return *null*.
+            1. If _token_ is not |SingleLineComment|, return *null*.


Also check for disallowed chars here so that both implementation match up

sokra · 2025-06-11T16:32:36Z

spec.emu

@@ -1376,25 +1376,17 @@
                1. Set _position_ to _position_ + 1.
                1. If _second_ is U+002F (SOLIDUS), then
                  1. Let _comment_ be the substring of _lineStr_ from _position_ to _lineLength_.
+                  1. If _comment_ contains the code point U+0022 (QUOTATION MARK), U+0027 (APOSTROPHE), U+002F (SOLIDUS), or U+0060 (GRAVE ACCENT), then


/ -> */

sokra · 2025-06-11T16:36:24Z

spec.emu

@@ -1350,7 +1350,7 @@
        <emu-alg>
          1. Let _tokens_ be the List of tokens obtained by parsing _source_ according to <emu-xref href="#sec-ecmascript-language-lexical-grammar">ECMA-262's lexical grammar</emu-xref>.
          1. For each nonterminal _token_ in _tokens_, in reverse order, do
-            1. If _token_ is not |SingleLineComment| or |MultiLineComment|, return *null*.


Does it have to skip over whitespace tokens here?

jkup · 2025-06-11T16:41:38Z

Some context (cc @DanielRosenwasser)

When we were first exploring the //#sourceMappingUrl concept last year we proposed adding to the official specification that you must parse the file, find the last comment and check it's URL. We quickly got feedback from the VSCode team that this approach would be too slow for them and they have a need to quickly extract the comment URL without parsing the entire source code.

So when we presented the specification to tc39 last year we pitched offering both approaches. We got some pushback that the regex approach should have the same functionality as the parsing approach (never returning different results) or at least that one approach should be a subset of the other. This PR is @nicolo-ribaudo's attempt at updating the regex approach to align with that goal but we wanted to make sure that this approach still works well for TypeScript and VSCode.

Avoid ambiguity in regexp-based extraction

f53df0b

nicolo-ribaudo marked this pull request as draft June 11, 2025 15:58

sokra reviewed Jun 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid ambiguity in regexp-based extraction #211

Avoid ambiguity in regexp-based extraction #211

Uh oh!

nicolo-ribaudo commented Jun 11, 2025

Uh oh!

sokra Jun 11, 2025

Uh oh!

sokra Jun 11, 2025

Uh oh!

sokra Jun 11, 2025

Uh oh!

jkup commented Jun 11, 2025

Uh oh!

Uh oh!

Avoid ambiguity in regexp-based extraction #211

Are you sure you want to change the base?

Avoid ambiguity in regexp-based extraction #211

Uh oh!

Conversation

nicolo-ribaudo commented Jun 11, 2025

Uh oh!

sokra Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

sokra Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

sokra Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

jkup commented Jun 11, 2025

Uh oh!

Uh oh!