[html] prevent that embedded grammars can corrupt syntax highlighting of rest of document #20488

octref · 2017-02-12T00:15:53Z

Currently, when writing embedded language, a lot of the times the embedded grammar can cause disruption to the rest of DOCUMENT:

This is html grammar.

Imagine editing css in a <style> at the <head> of a long html file. The rest of document will be constantly re-tokenized and (wrongly) syntax highlighted. This doesn't look good UX-wise and will probably have some perf impact.

For the following grammar and setting:

- name: source.css.embedded.html
   begin: <style>
   end: </style>
   patterns:
   - include: source.css

"embeddedLanguages": {
  "source.css": "css"
}

I hope VSCode can only apply CSS grammar to the section between <style> and </style>, so when I'm typing out my incomplete CSS VSCode would just re-tokenize and highlight until </style>, but not rest of the document.

/cc @alexandrudima

Also would like to hear @aeschli's opinion, as you are writing the html grammar.

The text was updated successfully, but these errors were encountered:

rebornix · 2017-02-14T19:18:41Z

At the moment I'm not sure if it's possible to modify the grammar to make this happen, but the whole tokenization relies on it. The grammar is really fragile, like even if you write a complete CSS style, right now we are still seeing the closing angle bracket <style> as part of the CSS (it can be fixed by #20610).

@aeschli 's HTML extension doesn't rely on the TM Grammar to figure out the embedded language boundary, which is not supported yet in the extension host as well. The extension parses the file and calculate the boundaries itself and it's perfectly correct. That's why the intellisense works reasonably even though the color is wrong. So the general feature request can be, if the extension itself knows about the tokens better than Grammar file does (very likely if there is a language service behind it), can the extension feed Code with the tokens information ?

octref · 2017-02-14T19:46:23Z

So the general feature request can be, if the extension itself knows about the tokens better than Grammar file does (very likely if there is a language service behind it), can the extension feed Code with the tokens information ?

Not really, for this issue I just want:

Embedded grammar shouldn't tokenize anything after end.
So when I have autoclose off, this kind of flicking color wouldn't happen. This especially bothers me on large html files.

aeschli · 2017-02-15T09:20:49Z

This has to do with the grammars involved here:
https://github.com/textmate/html.tmbundle and https://github.com/atom/language-css.

I just checked that TextMate behaves the same.

octref · 2017-02-15T16:39:17Z

@aeschli The grammar can behave the same. But I believe embedded grammar support differs from editor to editor (Sublime seems to have its own implementation). So I'm wondering if VSCode can change embedded grammar to this way:

For a rule with end, match end first
If end is found, tokenize section between start and end with embedded grammar
If end is not found, match for next rule

Now it's like:

Match the start of a rule
If that rule indicates embedded grammar, use embedded grammar to tokenize forward on
When no rule match, return to match the rule's end

However, some grammars when embedded could get "stuck" in step 2 and cause this flicking.

@alexandrudima Correct me if I'm wrong...

aeschli · 2017-02-20T09:48:51Z

Remember that the text mate grammar sees a line at a time. Mostly it can't see the </style> tag at the time it enters <style>.
Once it starts matching tokens in between and new states are entered, the end rule is no longer active, that's just how TextMate is designed. I don't think we should try to 'improve' that. We want to be 100% TextMate compatible.
The HTML language server has a proper parser that reads the whole file and knows the exact rules on when embedded ranges are ended. But running the parser is more expensive and runs asynchronously in its own process. That doesn't work well with tokenizing, which runs in the renderer and needs to update quickly, as you type.

What could be done is writing a HTML grammar that contains rules for embedded css and JavaScript, in particular the appropriate end rules for </style> and </script> tags.

alexdima · 2017-02-20T11:57:39Z

@aeschli 👍

TM grammars are parsed according to TM semantics.

TM has no notion of an embedded language, it simply has the notion of including rules from other scopes (from other files).

There are TM grammars split into multiple files that do not represent embedded languages (there is one I think for perl regular expressions or one for C platform functions).

Our only option to stay TM grammar compatible is to do what @aeschli suggests, where we have a special CSS grammar that expects it to be embedded inside HTML and has end rules for </style> everywhere. The alternative would be to fork TM parsing semantics, in which case I would do something entirely different...

octref · 2017-02-20T20:33:26Z

That's what sublime have:

https://www.sublimetext.com/docs/3/syntax.html

As an alternative, is it possible for us to have such a "pop" rule?

What could be done is writing a HTML grammar that contains rules for embedded css and JavaScript, in particular the appropriate end rules for </style> and </script> tags.

Our only option to stay TM grammar compatible is to do what @aeschli suggests, where we have a special CSS grammar that expects it to be embedded inside HTML and has end rules for </style> everywhere.

It's possible to modify the CSS grammar to make it expects itself to be embedded. But I don't know any way to do that by only modifying HTML grammar now in VSCode.

And I believe this is a problem shared by embedded language support. I don't think it's reasonable to ask each language's grammar to expect itself to be embedded. For example in Vue's single file component, one can embed css/sass/scss/less/stylus/postcss. I hope there is a way to mark the end of embedded region in HTML, instead of adding a negative lookahead to each grammar rule in these 6 languages.

onixie · 2017-03-29T14:00:52Z

The problem bothers me too. It is so easy to corrupt the syntax highlighting of the whole rest document when embedded language involves.

mjbvz · 2017-05-02T17:28:51Z

Markdown had a similar problem for fenced codeblocks. The solution was to switch from a begin +end rule to begin+while rule. Here's an example rule: https://github.com/Microsoft/vscode/blob/e105d5cc998b18e35ec8b2ca8b340a2bf2bdb3db/extensions/markdown/syntaxes/markdown.tmLanguage#L584

This case in HTML is a bit more complicated as you can also have single line style / script blocks. I believe these would have to be handled separately since a while rule would not work in that case

octref · 2017-05-03T17:27:31Z

@mjbvz Thanks I'll give it a try.

BTW I think a better way to write TM grammar is to write in JSON: https://github.com/octref/vetur/blob/master/syntaxes/vue.YAML

Then use https://github.com/SublimeText/PackageDev to compile them to XML.
Just in case you get tired of editing XML.

aeschli · 2017-09-12T19:50:53Z

Closing this issue as we don't want to diverge from TextMate.

aeschli · 2017-09-13T08:53:06Z

Reopening to collect issues around this issue.

SDP190 · 2018-06-03T18:48:02Z

To my opinion if VSCode cannot make TextMate grammars work as expected for html , VSCode team should consider other workarounds, or even use a different grammar at all.

That is, any html coder would expect to see syntax highlighting as a main feature of an editor, it is sometimes the easiest and simplest way to debug the code, breaking the highlight when using embedded grammar is a tough issue.

octref · 2018-06-06T22:16:29Z

Just constant reminder why this is indispensable for writing embedded grammars:

vuejs/vetur#811
vuejs/vetur#812

andremacola · 2018-08-17T02:33:44Z

VsCode always had serious problems with syntax highlight not only in PHP/HTML. This is annoying issue that prevent me to switch from ST3.

KamasamaK · 2018-08-21T19:47:12Z

I see Sublime's solution mentioned, but fyi Atom is addressing this with their new Tree-sitter grammar, which supports injection.

~~EDIT: Actually, it doesn't seem that their implementation solves this issue.~~

michaelblyons · 2018-08-25T17:48:24Z

Working on an HTML embedded syntax of my own, I have now had the (mis?)fortune to discover this ticket. I think the most compelling¹ example is the incomplete CSS one from #20488 (comment):

<style>
h1 {
</style>

Any eager rule in the injected syntax definition disregards the end-embed tag. This is also true for C#, where <% this.worksFine(); %> but opening a brace clobbers the end-injection tag: <% if(you.haveProblems()) { %> <p>Sorry!</p>

There are at least two ways to fix this.

In Sublime Text, there are two constructs² that make an effective end regex take precedence over any included patterns. This means that it's injected to the left in all nested patterns until the stack recedes to uncover it. I understand why this is annoying for the engine design, but it is critical for injectable languages.

This does have a limitation, though, wherein contexts where the end pattern should not apply still jump back to the parent: <%= foo["uh-oh%>"] %>
In my specific case, if there were meta (or other) scopes specific to C# brace blocks, I could set up injections to push a grandchild text.html.basic when encountering the closing tag %> in text.html meta.embedded source.cs meta.block. If the code is valid, the closing braces <% } %> would all eventually rectify themselves.

In a sense, this is a more accurate representation of the scope stack, but it doesn't help the folks with dangling braces like the CSS above. (It also doesn't help me since the C# syntax definition doesn't put curly-brace blocks on the scope stack.)

The Markdown syntax takes advantage of the fact that it knows the end tag will be the only thing on the line for its while escape.

@KamasamaK Can you elaborate on why you thought tree-sitter would solve this problem and why it turns out not to?

I suspect that the PHP one is less significant. It might be possible to avert it by adding a L:text.html.php source.css scope to the end of line 112. ↩
with_prototype and embed/escape ↩

KamasamaK · 2018-08-27T21:46:41Z

@michaelblyons Sorry, it actually does work as intended for what is implemented. I was testing in Atom v1.29.0 but it wasn't implemented until v1.30, which is still in beta. Also, embedded CSS still does not work at all.

The reason I thought it would work is because the tree-sitter grammars are used to actually parse the full code into a syntax tree. For embedded languages, it would be using distinct grammars to parse those languages and (I think) maintaining those trees independently which should mean that they do not affect each other directly. Also, tree-sitter has inherent error recovery.

octref · 2018-08-28T20:31:37Z

Here is what Atom provides: https://flight-manual.atom.io/hacking-atom/sections/creating-a-grammar/#language-injection. The embedded language can specify an injectionRegex that's being read by the outer language grammar.

msftrncs · 2018-09-21T02:37:27Z

I want to add, that this problem is not unique to embedding grammers, such as JS or CSS in HTML.

It also occurs with complex and quirky languages such as Batch Script. Its related in that some commands, like the comment, consume the rest of the line. This can be handled with an exception to the line end, but usually its a hack that doesn't fully conform with the language.

Instance:

IF DEFINED A (
	echo hello
	echo a=012) else echo no a
echo no next

The echo can consume the entire line, up to certain characters, but only up to the ) character if the echo appears inside a block as shown. Notice how the following doesn't color correctly here:

echo a=012) else echo no a

and to demonstrate no )

echo a=012 else echo no a

The colorizer here has no way to know if its in a block, but assumes it is, and that else echo are keywords. Myself I would prefer the scoping of the entire contents of the echo as 'string.unquoted' because that is what it is. This statement outside of a block (and not part of an IF block) would echo:

a=012) else echo no a

The following one actually tripped me up, and proper handling (coloring or a syntax checker, but even a syntax checker in batchfile is difficult) in the editor would have saved me some trouble:

:: * Check if already performed copy today, if so, no more copies, else clean-up any previous day flags
IF EXIST "%~5\Replication.Flags\%YEARMONTHDAY%" (
    ECHO:%DATE% %TIME%-Replication already completed for today! (%YEARMONTHDAY%)>>"%LOGFILE%"
    GOTO EXITFLAG
)

The closing ) in the ECHO statement closed the block, and the script always jumped to EXITFLAG regardless of the condition. This isn't the only nearly identical ECHO statement in the script, but its the only one in a block, and they all looked exactly alike. (again, my preference for making ECHO scope as an unquoted string, it just might have been more apparent if the block ending ) would have appeared differently than the ones treated as echo text)

And to make matters worse, this doesn't happen everywhere. a REM cannot be broken by any of the special characters, but REM: (among others) can be (unless they're in double quotes).

The SET statement is even better. It also consumes an entire line, and you can start string or block constructs (with /A), but they always terminate at the end of the line, whether the end construct came or not, except lack of a block close causes an error, which can cause confusion if you are already in a block. Not hard to handle, since you just use SET specific block and quoting rules, but handling of outer blocks is the same problem.

I tried using such a rule as to have the block rule capture all the line, up to but not including another open block, or string quote, or the group end, parsing the capture, and then including only additional group or string items, but this hasn't worked so far, because it will prevent other block constructs (IF or FOR) from scoping their whole block. Even if I could get this to work right, it will probably break line continuation, if not other things.

msftrncs · 2018-09-25T06:37:10Z

Having spent a few more days trying to make the current batchfile (batch script) syntax closer to how 'CMD' actually reacts, and I am pretty sure this is what TextMate and the tmLanguage system needs:

a new block construct meant for the purpose of capturing multiple lines (still processed one line at a time) for sub-tokenizing together in a specified scope(s).
allow for inclusion of patterns that further capture lines, preventing the matching the blocks end point.
Then tokenize this captured content via the specified scope(s) all the captured document, from end of the beginning match to the beginning of the end match, again 1 line at a time.

Maybe use BEGIN_SUBPARSE/END_SUBPARSE/SUBPARSE_PATTERNS with just PATTERNS being the exception patterns that consume the document to prevent the end from matching, and SUBPARSE_PATTERNS being the scopes to apply to the content.

Standard repository items could be used for the inclusion patterns, but any scoping they would do would be ignored.

This actually matches how CMD processes input, specifically where grouping is involved. It normally reads in 1 line a time, unless in that line it finds a group that hasn't ended, then it reads ahead further lines only checking for exceptions to finding the group end, and then returns to processing the collected lines one command at a time. A command on a line by itself may process its line completely different than when a preceding group consumes it.

For instance, by itself:

SET "avar=" hello) & echo !!!"

the SET command tokenizes the line using the outer quotes, stripping them, resulting in
avar, =, and " hello) & echo !!!

But

IF DEFINED avar THEN (echo Yes) ELSE (
SET "avar=" hello) & echo !!!"

will, if 'avar' is not defined, cause 'avar' to be erased (but its not even defined) as the set is restricted to tokenizing avar, =, and then echoing !!!"

Now, if you take the arithmetic option, you get a new issue, as parenthesis are permitted for arithmetic grouping. But these are not to be confused with the code groups, and in fact, CMD only accepts opening of groups at certain places, so the arithmetic grouping of the SET /A command do not count, and thus the first closing parenthesis will be taken as the closing of the command group, rather than the closing of an arithmetic set.

tomByrer · 2019-05-02T05:06:40Z

Any progress on this please?

HugoGranstrom · 2021-06-25T14:34:43Z

I'll chip in that I have had huge problems regarding this as well and for the life of me haven't understood why. Disappointingly enough, it seems to boil down to the fact that embedded languages are inherently broken because the devs don't want to make additions on top of the TextMate grammar. That is a real shame because it turns embedded languages into a random mess, sometimes it works, sometimes it doesn't, you never know which. There have been several valid proposals (copying pop from sublime for example) that just seems to be ignored just to "conform to Textmate grammar no matter what even if it means having a broken feature"...

joaomoreno assigned octref, alexdima, aeschli and rebornix and unassigned octref Feb 14, 2017

aeschli added this to the Backlog milestone Feb 20, 2017

aeschli added the feature-request Request for new features or functionality label Feb 21, 2017

alexdima removed their assignment Mar 28, 2017

mjbvz added the languages-basic Basic language support issues label May 2, 2017

mjbvz self-assigned this May 2, 2017

octref mentioned this issue May 3, 2017

Update grammar to better accommodate embedded languages vuejs/vetur#174

Closed

ezekg mentioned this issue Aug 20, 2017

.tmLanguage & VS Code Support mde/ejs#293

Open

aeschli closed this as completed Sep 12, 2017

aeschli changed the title ~~Apply embedded language grammar only to section between begin and end~~ [html] embedded grammars corrupt syntax highlighting of rest of document Sep 13, 2017

aeschli reopened this Sep 13, 2017

aeschli added the debt Code quality issues label Sep 13, 2017

aeschli changed the title ~~[html] embedded grammars corrupt syntax highlighting of rest of document~~ [html] prevent that embedded grammars can corrupt syntax highlighting of rest of document Sep 13, 2017

aeschli mentioned this issue Sep 13, 2017

[html] syntax highlighting broken if there is a syntax error in the style tag #13112

Closed

rebornix removed their assignment Jun 4, 2018

michaelblyons mentioned this issue Aug 29, 2018

Certain HTML file crashes Atom (w/ Tree Sitter) atom/language-html#211

Closed

1 task

aeschli mentioned this issue Oct 4, 2018

Incorrect comment format for HTML #59854

Closed

myfonj mentioned this issue Feb 5, 2019

JSDoc @example tag breaks syntax highlighting #67916

Closed

octref mentioned this issue Feb 15, 2019

Element attribute break highlighting when using "`" on single line reference values. vuejs/vetur#1095

Closed

3 tasks

aeschli mentioned this issue Mar 27, 2019

[html] embedded java script grammar doesn't end because of missing semicolon #70614

Open

Xanewok mentioned this issue Apr 7, 2019

Highlight code snippets in doc comments rust-lang/vscode-rust#393

Open

aeschli mentioned this issue Apr 23, 2019

larger HTML with mixed JS mode forces all after to be JS #72010

Closed

tomByrer mentioned this issue May 2, 2019

JSX bracket improvement johnpapa/vscode-winteriscoming#22

Closed

aeschli mentioned this issue May 2, 2019

Ctrl+/ no longer properly comments out elements in HTML #73143

Closed

matter123 mentioned this issue May 6, 2019

meta.processor and meta.struct are never closed jeff-hykin/better-cpp-syntax#30

Closed

dannymcgee mentioned this issue Dec 12, 2019

Certain valid expression fragments in Angular attributes break syntax highlighting dannymcgee/vscode-ng-html#4

Closed

ayazhafiz mentioned this issue Jan 11, 2020

Template Syntax Highlighting broken angular/vscode-ng-language-service#541

Closed

KamasamaK mentioned this issue Feb 19, 2020

HTML syntax highlighting broken by inline script tag #90942

Closed

bjorhn mentioned this issue Apr 8, 2021

Syntax highlighting support for embedded C# snippets in daf files data-automation-framework/daf-core#10

Open

aeschli mentioned this issue Dec 19, 2022

HTML: some part of big file is parsed wrong #144479

Closed

flowchartsman mentioned this issue Nov 2, 2023

Apostrophes (single quotes) triggering weird highlighting. asciidoctor/asciidoctor-vscode#804

Open

ruschaaf mentioned this issue May 15, 2024

Edge case when SQL comment followed by end of string ruschaaf/extended-embedded-languages#8

Closed

ruschaaf mentioned this issue Sep 13, 2024

Embedded languages syntax highlighting stopped working in 1.92.0 #224978

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[html] prevent that embedded grammars can corrupt syntax highlighting of rest of document #20488

[html] prevent that embedded grammars can corrupt syntax highlighting of rest of document #20488

octref commented Feb 12, 2017

rebornix commented Feb 14, 2017

octref commented Feb 14, 2017 •

edited

Loading

aeschli commented Feb 15, 2017

octref commented Feb 15, 2017

aeschli commented Feb 20, 2017 •

edited

Loading

alexdima commented Feb 20, 2017

octref commented Feb 20, 2017

onixie commented Mar 29, 2017 •

edited

Loading

mjbvz commented May 2, 2017 •

edited

Loading

octref commented May 3, 2017

aeschli commented Sep 12, 2017

aeschli commented Sep 13, 2017

SDP190 commented Jun 3, 2018

octref commented Jun 6, 2018

andremacola commented Aug 17, 2018

KamasamaK commented Aug 21, 2018 •

edited

Loading

michaelblyons commented Aug 25, 2018 •

edited

Loading

KamasamaK commented Aug 27, 2018

octref commented Aug 28, 2018 •

edited

Loading

msftrncs commented Sep 21, 2018

msftrncs commented Sep 25, 2018

tomByrer commented May 2, 2019

HugoGranstrom commented Jun 25, 2021

[html] prevent that embedded grammars can corrupt syntax highlighting of rest of document #20488

[html] prevent that embedded grammars can corrupt syntax highlighting of rest of document #20488

Comments

octref commented Feb 12, 2017

rebornix commented Feb 14, 2017

octref commented Feb 14, 2017 • edited Loading

aeschli commented Feb 15, 2017

octref commented Feb 15, 2017

aeschli commented Feb 20, 2017 • edited Loading

alexdima commented Feb 20, 2017

octref commented Feb 20, 2017

onixie commented Mar 29, 2017 • edited Loading

mjbvz commented May 2, 2017 • edited Loading

octref commented May 3, 2017

aeschli commented Sep 12, 2017

aeschli commented Sep 13, 2017

SDP190 commented Jun 3, 2018

octref commented Jun 6, 2018

andremacola commented Aug 17, 2018

KamasamaK commented Aug 21, 2018 • edited Loading

michaelblyons commented Aug 25, 2018 • edited Loading

Footnotes

KamasamaK commented Aug 27, 2018

octref commented Aug 28, 2018 • edited Loading

msftrncs commented Sep 21, 2018

msftrncs commented Sep 25, 2018

tomByrer commented May 2, 2019

HugoGranstrom commented Jun 25, 2021

octref commented Feb 14, 2017 •

edited

Loading

aeschli commented Feb 20, 2017 •

edited

Loading

onixie commented Mar 29, 2017 •

edited

Loading

mjbvz commented May 2, 2017 •

edited

Loading

KamasamaK commented Aug 21, 2018 •

edited

Loading

michaelblyons commented Aug 25, 2018 •

edited

Loading

octref commented Aug 28, 2018 •

edited

Loading